Gaussian Process

By Prof. Seungchul Lee
http://iai.postech.ac.kr/
Industrial AI Lab at POSTECH

Source

S. Lall at Stanford (http://lall.stanford.edu/)

Jeffrey Miller at Harvard (http://jwmi.github.io/)

Andrew Ng at Stanford (http://cs229.stanford.edu/)

Philipp Henning at Max Planck Institute for Intelligent Systems (https://pn.is.tuebingen.mpg.de/people/phennig)
- hennig slides1.pdf
- https://www.youtube.com/watch?v=50Vgw11qn0o

Richard Turner at the University of Cambridge (http://cbl.eng.cam.ac.uk/Public/Turner/WebHome)
- Gaussian Processes by Richard Turner
- https://www.youtube.com/watch?v=92-98SYOdlY

Example and code
- https://blog.dominodatalab.com/fitting-gaussian-process-models-python/

Table of Contents

1. Gaussian Random Vectors¶

Suppose $x \sim N(\mu, \Sigma)$, here $\Sigma = \Sigma^T$ and $\Sigma > 0$.

$$ p(x) = \frac{1}{(2\pi)^{\frac{n}{2}}(\text{det}\,\Sigma)^{\frac{1}{2}}} \exp \left( -\frac{1}{2} (x-\mu)^T \Sigma^{-1} (x-\mu) \right)$$

1.1. The marginal pdf of a Gaussian (is Gaussian)¶

Suppose $x \sim N(\mu, \Sigma)$, and

$$ x = \begin{bmatrix} x_1 \\ x_2 \end{bmatrix} \quad \mu = \begin{bmatrix} \mu_1 \\ \mu_2 \end{bmatrix} \quad \Sigma = \begin{bmatrix} \Sigma_{11} & \Sigma_{12} \\ \Sigma_{21} & \Sigma_{22} \end{bmatrix} $$

Let's look at the component $x_1 = \begin{bmatrix} I & 0 \end{bmatrix} x$

mean of $x_1$

$$E x_1 = \begin{bmatrix} I & 0 \end{bmatrix} \mu= \mu_1$$

convariance of $x_1$

$$\text{cov}(x_1) = \begin{bmatrix} I & 0 \end{bmatrix} \Sigma \begin{bmatrix} I \\ 0 \end{bmatrix} = \Sigma_{11}$$

In fact, the random variable $x_1$ is Gaussian, (this is not obvious.)

1.2. Linear transformation of Gaussian (is Gaussian)¶

Suppose $x \sim N(\mu_x, \Sigma_x)$. Consider the linear function of $x$

$$y = Ax + b$$

We already know how means and covariances transform. We have

$$\mathbf{E}y = A \, \mathbf{E}x + b \qquad \mathbf{cov}(y) = A \, \mathbf{cov}(x) \, A^{T}$$

$$\mu_y = A \mu_x + b \quad \qquad \Sigma_y = A \Sigma_x A^{T}$$

The amazing fact is that $y$ is Gaussian.

1.3. Components of a Gaussian random vector (is Gaussian)¶

Suppose $x \sim N(0, \Sigma)$, and let $c \in \mathbb{R}^n$ be a unit vector

Let $y = c^T x$

$y$ is the component of $x$ in the direction $c$

$y$ is Gaussian, with $\mathbf{E} y = 0$ and $\mathbf{cov}= c^T \Sigma c$

So $\mathbf{E}\left(y^2\right) = c^T \Sigma c$

(PCA) The unit vector $c$ that maximizes $c^T \Sigma c$ is the eigenvector of $\Sigma$ with the largest eigenvalue. Then

$$\mathbf{E}\left(y^2\right) = \lambda_{\text{max}}$$

1.4. Conditional pdf for a Gaussian (is Gaussian)¶

Suppose $x \sim N(\mu, \Sigma)$, and

$$ x = \begin{bmatrix} x_1 \\ x_2 \end{bmatrix} \quad \mu = \begin{bmatrix} \mu_1 \\ \mu_2 \end{bmatrix} \quad \Sigma = \begin{bmatrix} \Sigma_{11} & \Sigma_{12} \\ \Sigma_{21} & \Sigma_{22} \end{bmatrix} $$

Suppose we measure $x_2 = y$. We would like to find the conditional pdf of $x_1$ given $x_2 = y$

Is it Gaussian ?

What is the conditional mean $\mathbf{E}(x_1 \mid x_2= y)$ ?

What is the conditional covariance $\mathbf{cov}(x_1 \mid x_2= y)$ ?

By the completion of squares formula

$$\Sigma^{-1} = \begin{bmatrix} I & 0 \\ -\Sigma^{-1}_{22}\Sigma_{21} & I \end{bmatrix} \begin{bmatrix} \left( \Sigma_{11} - \Sigma_{12}\Sigma^{-1}_{22}\Sigma_{21} \right)^{-1} & 0 \\ 0 & \Sigma^{-1}_{22} \end{bmatrix} \begin{bmatrix} I & -\Sigma_{12}\Sigma^{-1}_{22} \\ 0 & I \end{bmatrix}$$

If $x \sim N(0, \Sigma)$, then the conditional pdf of $x_1$ given $x_2 = y$ is Gaussian

The conditional mean is

$$\mathbf{E}(x_1 \mid x_2= y) = \Sigma_{12} \Sigma^{-1}_{22}\,y$$

$\quad \;$ It is a linear function of $y$.

The conditional convariance is

$$\mathbf{cov}(x_1 \mid x_2= y) = \Sigma_{11} - \Sigma_{12}\Sigma^{-1}_{22}\Sigma_{21} < \Sigma_{11}$$

$\quad \;$ It is not a function of $y$. Instead, it is constant.

$\quad \;$ Conditional confidence intervals are narrower. i.e., measuring $x_2$ gives information about $x_1$

If $x \sim N(\mu, \Sigma)$, then the conditional pdf of $x_1$ given $x_2 = y$ is Gaussian

$$p(x_1 \mid x_2 = y) = \mathcal{N}\left( \mu_1 + \Sigma_{12} \Sigma^{-1}_{22} \,(y-\mu_2),\; \Sigma_{11} - \Sigma_{12}\Sigma^{-1}_{22}\Sigma_{21} \right)$$

1.5 Summary¶

Closure under multiplication
- multiple Gaussian factors form a Gaussian

Closure under linear maps
- linear maps of Gaussians are Gaussians

Closure under marginalization
- projections of Gausians are Gaussian

Closure under conditioning
- cuts throuhg Gaussians are Gaussians

2. Linear Model¶

$$y = Ax + \omega$$

$x$ and $\omega$ are independent

We have induced pdfs $p_x$ and $p_{\omega}$

Estimation problem: we measure $y = y_{\text{meas}}$ and would like to estimate $x$.

$$\begin{bmatrix} x \\ y\end{bmatrix} = \begin{bmatrix} I & 0 \\ A &I\end{bmatrix} \begin{bmatrix} x \\ \omega\end{bmatrix}$$

We will measure $y = y_{\text{meas}}$ and estimate $x$

To do this, we would like the conditional pdf of $x \mid y = y_{\text{meas}}$

For this, we need the joint pdf of $x$ and $y$

2.1. Linear measurements with Gaussian noise¶

$$y = Ax + \omega$$

Suppose $x \sim N(0, \Sigma_x)$ and $\omega \sim N(0, \Sigma_{\omega})$

So $\begin{bmatrix} x \\ y\end{bmatrix}$ is Gaussian with mean and covariance

$$ \mathbf{E}\begin{bmatrix} x \\ y\end{bmatrix} = \begin{bmatrix} 0 \\ 0\end{bmatrix}, \quad \mathbf{cov} \begin{bmatrix} x \\ y\end{bmatrix} = \begin{bmatrix} \Sigma_x & \Sigma_x A^T\\ A \Sigma_x & A\Sigma_x A^T + \Sigma_{\omega}\end{bmatrix}$$

The MMSE estimate of $x$ given $y = y_{\text{meas}}$ is

$$\hat{x}_{\text{mmse}} = \mathbf{E}(x \mid y= y_{\text{meas}}) = \Sigma_{12} \Sigma^{-1}_{22}\,y = \Sigma_x A^T \left( A\Sigma_x A^T + \Sigma_{\omega}\right)^{-1} \, y_{\text{meas}}$$

The posterior covariance of $x$ given $y = y_{\text{meas}}$ is

$$ \mathbf{cov}(x \mid y= y_{\text{meas}}) = \Sigma_x - \Sigma_x A^T \left( A\Sigma_x A^T + \Sigma_{\omega}\right)^{-1} A \Sigma_x < \Sigma_x$$

Example: linear measurements with Gaussian noise

2.1.1. Signal-to-noise ratio¶

2.1.2. Matrix equality¶

$$ \mathbf{cov}(x \mid y= y_{\text{meas}}) = \Sigma_x - \Sigma_x A^T \left( A\Sigma_x A^T + \Sigma_{\omega}\right)^{-1} A \Sigma_x = \left( \Sigma_{x}^{-1} + A^T\Sigma_{\omega}^{-1} A\right)^{-1}$$

$$L=\Sigma_{12}\Sigma_{22}^{-1} = \Sigma_x A^T \left( A\Sigma_x A^T + \Sigma_{\omega} \right)^{-1} = \left( \Sigma_{x}^{-1} + A^T \Sigma_{\omega}^{-1} A \right)^{-1} A^T \Sigma_{\omega}^{-1}$$

2.1.3. Non-zero means¶

Suppose $x \sim N(\mu_x, \Sigma_x)$ and $\omega \sim N(\mu_{\omega}, \Sigma_{\omega})$

The MMSE estimate of $x$ given $y = y_{\text{meas}}$ is

$$\hat{x}_{\text{mmse}} =\mu_x + \Sigma_x A^T \left( A\Sigma_x A^T + \Sigma_{\omega}\right)^{-1} \, \left(y_{\text{meas}} - A\mu_x - \mu_{\omega} \right)$$

2.2. MMSE and Least-Squares¶

For Gaussian, the MMSE estimate is equal to the MAP estimate.

The least-squares approach minimizes

$$\lVert y - Ax\rVert^2 = \sum_{i = 1}^{m}\left( y_i - a_i^T x\right)^2$$

where $A = [a_1 \; a_2\; \cdots \; a_m]^T$

2.3. Weighted norms¶

Suppose instead we minimize

$$\sum_{i = 1}^{m} \omega_i \left( y_i - a_i^T x\right)^2$$

where $\omega_1,\; \omega_2,\; \cdots,\; \omega_m$ provide weights

2.3.1. Weighted Least-Squares¶

$$\text{minimize} \; \lVert y_{\text{meas}} - Ax\rVert_W$$

Then the optimum $x_{\text{wls}}$ is

$$x_{\text{wls}} = (A^TWA)^{-1}A^TW\, y_{\text{meas}}$$

$Ax_{\text{wls}}$ is the closest (in weighted-norm) point in range A to $y_{\text{meas}}$

2.3.2. MMSE and Weighted Least-Squares¶

Suppose we choose weight $W = \Sigma_{\omega}^{-1}$, then

WLS solution is

$$x_{\text{wls}} = (A^T\Sigma_{\omega}^{-1}A)^{-1}A^T\Sigma_{\omega}^{-1}\, y_{\text{meas}}$$

MMSE estimate is

$$x_{\text{mmse}} = \left( \Sigma_{x}^{-1} + A^T \Sigma_{\omega}^{-1} A \right)^{-1} A^T \Sigma_{\omega}^{-1} y_{\text{meas}}$$

as the prior covariance $\Sigma_x \rightarrow \infty$, the MMSE estimate tends to the WLS estimate

If $\Sigma_{\omega} = I$ then MMSE tends to usual least-squares solution as $\Sigma_x \rightarrow \infty$

the weighted norm heavily penalizes the residual $y-Ax$ in low-noise direction

3. Gaussian Process Regression¶

3.1. Bayesian linear regression¶

Why not use MLE? overfitting
Why not use MAP? No representation of uncertainty.
Why Bayesian? we can optimize loss function

$\quad \;$ It gives us $p(y \mid x, D)$. This is what we really want

Motivation

%%html
<center><iframe src="https://www.youtube.com/embed/dtkGq9tdYcI"
width="560" height="315" frameborder="0" allowfullscreen></iframe></center>

Weight-space view

Function-space view will be discussed at the following section.

3.2. Gaussian process regression¶

$$D = \left\{ (x_1, y_1),\cdots,(x_n, y_n)\right\}, \; x_i \in \mathbb{R}^d, \,y \in \mathbb{R}$$

$$ p(y_i \mid x_i, \omega) = N(y_i \mid \omega^T x_i, \sigma^2) \qquad \omega \sim N(0,\gamma I)$$

$$ y_i = \omega^T x_i + \varepsilon_i$$

$$\implies \forall x \in \mathbb{R}^d, \; \omega^Tx \;\text{ is univariate Gaussian} $$

Then $Z_x = x^T \omega$ is a GP on $S = \mathbb{R}^d$

$$\mu (x) = \mathbf{E}\,Z_x = 0$$

$$k(x,x') = \mathbf{cov}\,(Z_x, Z_{x'})= \mathbf{E}\, Z_x Z_{x'} - \mathbf{E}\, Z_x \mathbf{E}\,Z_{x'} = \mathbf{E}\,x^T \omega \omega^T x' = x^T \mathbf{E}\,\left(\omega \omega^T \right) x' = \gamma x^Tx'$$

With non-linear basis

$\tilde x_i = \varphi (x_i)$

Generally

$$ y_i = Z_{x_i} + \varepsilon_i$$

%%html
<center><iframe src="https://www.youtube.com/embed/upJ31CIVWZo" 
width="560" height="315" frameborder="0" allowfullscreen></iframe></center>

The key step of GP regresssion

$$ y_i = Z_{x_i} + \varepsilon_i$$

Let $z \sim N(\mu, K) \in \mathbb{R}^n$, and $\varepsilon \sim N(0,\sigma^2I) \in \mathbb{R}^n$ are independent

$$y = z + \varepsilon \sim N(\mu, \underbrace{K + \sigma^2 I}_{C})$$

$a = [1,\cdots, l]^T$ and $b = [l+1,\cdots, n]^T$

$$y = \begin{bmatrix} y_a \\ y_b\end{bmatrix}$$

$$\mu = \begin{bmatrix} \mu_a \\ \mu_b \end{bmatrix} \quad C = \begin{bmatrix} C_{aa} & C_{ab} \\ C_{ba} & C_{bb}\end{bmatrix} \\ K = \begin{bmatrix} K_{aa} & K_{ab} \\ K_{ba} & K_{bb} \end{bmatrix} $$

Conditional distribution $p(y_a \mid y_b) \sim N(m,D)$

where

$$ \begin{align*} m &= \mu_a + C_{ab}C_{bb}^{-1}(y_b - \mu_b) \\ D &= C_{aa} - C_{ab}C_{bb}^{-1}C_{ba} \end{align*} $$

$$ \begin{align*} m &= \mu_a + K_{ab}(K_{bb}+\sigma^2I)^{-1}(y_b - \mu_b) \\ D &= (K_{aa} + \sigma^2 I) - K_{ab}(K_{bb}+\sigma^2I)^{-1}K_{ba} \end{align*} $$

%%html
<center><iframe src="https://www.youtube.com/embed/UH1d2mxwet8" 
width="560" height="315" frameborder="0" allowfullscreen></iframe></center>

%%html
<center><iframe src="https://www.youtube.com/embed/JdZr74mtZkU" 
width="560" height="315" frameborder="0" allowfullscreen></iframe></center>

Function-space view

Squared exponential (~RBF) kernel

Gaussian Processes: From the Basics to the State-of-the-Art, Imperial's ML tutorial series

Gaussian Processes by Richard Turner

%%html
<center><iframe src="https://www.youtube.com/embed/92-98SYOdlY" 
width="560" height="315" frameborder="0" allowfullscreen></iframe></center>

4. GP Regression with Python¶

Example and code are from https://blog.dominodatalab.com/fitting-gaussian-process-models-python/

we can describe a Gaussian process as a distribution over functions. Just as a multivariate normal distribution is completely specified by a mean vector and covariance matrix, a GP is fully specified by a mean function and a covariance function:

$$p(x) \sim \mathcal{GP} \left(m(x),k \left(x,x' \right) \right)$$

For example, one specification of a GP might be:

$$ \begin{align*} m(x) &= 0\\ k(x,x') &= \theta_1 \exp\left( -\frac{\theta_2}{2} \left(x-x' \right)^2 \right) \end{align*} $$

Here, the covariance function is a squared exponential, for which values of $x$ and $x^{\prime}$ that are close together result in values of k closer to one, while those that are far apart return values closer to zero.

import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline

$$k(x,x') = \theta_1 \exp\left( -\frac{\theta_2}{2} \left(x-x' \right)^2 \right)$$

def exponential_cov(x, y, params):
    return params[0] * np.exp( - params[1]/2 * np.subtract.outer(x, y)**2)

$$p(x \mid y) = \mathcal{N}\left( \mu_x + \Sigma_{xy} \Sigma^{-1}_{y} \,(y-\mu_y),\; \Sigma_{x} - \Sigma_{xy}\Sigma^{-1}_{y}\Sigma_{xy}^T \right)$$

def conditional(x_new, x, y, params):

    B = exponential_cov(x_new, x, params)
    C = exponential_cov(x, x, params)
    A = exponential_cov(x_new, x_new, params)

    mu = np.linalg.inv(C).dot(B.T).T.dot(y)
    sigma = A - B.dot(np.linalg.inv(C).dot(B.T))

    return(mu.squeeze(), sigma.squeeze())

theta = [1, 5]
sigma = exponential_cov(0, 0, theta)
xpts = np.arange(-3, 3, step=0.01)
plt.errorbar(xpts, np.zeros(len(xpts)), yerr=sigma, capsize=0)

<Container object of 3 artists>

x = [1.]
y = [np.random.normal(scale=sigma)]
y

[-0.2553116444435854]

sigma_1 = exponential_cov(x, x, theta)

def predict(x, data, kernel, params, sigma, t):
    k = [kernel(x, y, params) for y in data]
    Sinv = np.linalg.inv(sigma)
    y_pred = np.dot(k, Sinv).dot(t)
    sigma_new = kernel(x, x, params) - np.dot(k, Sinv).dot(k)
    return y_pred, sigma_new

x_pred = np.linspace(-3, 3, 1000)
predictions = [predict(i, x, exponential_cov, theta, sigma_1, y) for i in x_pred]

y_pred, sigmas = np.transpose(predictions)
plt.errorbar(x_pred, y_pred, yerr=sigmas, capsize=0)
plt.plot(x, y, "ro")

[<matplotlib.lines.Line2D at 0x1e8d3ebf518>]

m, s = conditional([-0.7], x, y, theta)
y2 = np.random.normal(m, s)
y2

0.24147030277027695

x.append(-0.7)
y.append(y2)

sigma_2 = exponential_cov(x, x, theta)
predictions = [predict(i, x, exponential_cov, theta, sigma_2, y) for i in x_pred]

y_pred, sigmas = np.transpose(predictions)
plt.errorbar(x_pred, y_pred, yerr=sigmas, capsize=0)
plt.plot(x, y, "ro")

[<matplotlib.lines.Line2D at 0x1e8d1dfe438>]

x_more = [-2.1, -1.5, 0.3, 1.8, 2.5]
mu, s = conditional(x_more, x, y, theta)
y_more = np.random.multivariate_normal(mu, s)
y_more

array([ 1.39778923, -0.32097195,  0.61480776, -1.1724211 , -0.95255744])

x += x_more
y += y_more.tolist()

sigma_new = exponential_cov(x, x, theta)
predictions = [predict(i, x, exponential_cov, theta, sigma_new, y) for i in x_pred]

y_pred, sigmas = np.transpose(predictions)
plt.errorbar(x_pred, y_pred, yerr=sigmas, capsize=0)
plt.plot(x, y, "ro")

[<matplotlib.lines.Line2D at 0x1e8d3f8b4e0>]

There are a number of libraries available for specifying and fitting GP models in a more automated way.

scikit-learn (we will explore this)
GPflow
PyMC3

from sklearn.gaussian_process import GaussianProcessRegressor
from sklearn.gaussian_process.kernels import Matern, WhiteKernel, ConstantKernel

x = np.linspace(0,10,100)
r = np.sin(x)

plt.plot(x,r)
plt.show()

y = r + 1*np.random.rand(100)
plt.plot(x,y,'.')

[<matplotlib.lines.Line2D at 0x1e8d5d49588>]

scikit-learn offers a library of about kernels to choose from. A flexible choice to start with is the Matern covariance.

kernel = ConstantKernel() + Matern(length_scale=2, nu=3/2) + WhiteKernel(noise_level=1)

X = x.reshape(-1,1)

gp = GaussianProcessRegressor(kernel = kernel)

gp.fit(X,y)

GaussianProcessRegressor(alpha=1e-10, copy_X_train=True,
             kernel=1**2 + Matern(length_scale=2, nu=1.5) + WhiteKernel(noise_level=1),
             n_restarts_optimizer=0, normalize_y=False,
             optimizer='fmin_l_bfgs_b', random_state=None)

gp.kernel_

0.00316**2 + Matern(length_scale=2.67, nu=1.5) + WhiteKernel(noise_level=0.0689)

x_pred = np.linspace(0, 10, 100).reshape(-1,1)
y_pred, sigma = gp.predict(x_pred, return_std=True)

plt.plot(x_pred, y_pred)

[<matplotlib.lines.Line2D at 0x1e8d5d8e940>]

%%javascript
$.getScript('https://kmahelona.github.io/ipython_notebook_goodies/ipython_notebook_toc.js')