Gaussian Process



By Prof. Seungchul Lee
http://iai.postech.ac.kr/
Industrial AI Lab at POSTECH

Table of Contents

1. Gaussian Random Vectors


Suppose $x \sim N(\mu, \Sigma)$, here $\Sigma = \Sigma^T$ and $\Sigma > 0$.



$$ p(x) = \frac{1}{(2\pi)^{\frac{n}{2}}(\text{det}\,\Sigma)^{\frac{1}{2}}} \exp \left( -\frac{1}{2} (x-\mu)^T \Sigma^{-1} (x-\mu) \right)$$


1.1. The marginal pdf of a Gaussian (is Gaussian)


Suppose $x \sim N(\mu, \Sigma)$, and


$$ x = \begin{bmatrix} x_1 \\ x_2 \end{bmatrix} \quad \mu = \begin{bmatrix} \mu_1 \\ \mu_2 \end{bmatrix} \quad \Sigma = \begin{bmatrix} \Sigma_{11} & \Sigma_{12} \\ \Sigma_{21} & \Sigma_{22} \end{bmatrix} $$


Let's look at the component $x_1 = \begin{bmatrix} I & 0 \end{bmatrix} x$

  • mean of $x_1$


$$E x_1 = \begin{bmatrix} I & 0 \end{bmatrix} \mu= \mu_1$$


  • convariance of $x_1$


$$\text{cov}(x_1) = \begin{bmatrix} I & 0 \end{bmatrix} \Sigma \begin{bmatrix} I \\ 0 \end{bmatrix} = \Sigma_{11}$$
  • In fact, the random variable $x_1$ is Gaussian, (this is not obvious.)


1.2. Linear transformation of Gaussian (is Gaussian)


Suppose $x \sim N(\mu_x, \Sigma_x)$. Consider the linear function of $x$


$$y = Ax + b$$


  • We already know how means and covariances transform. We have


$$\mathbf{E}y = A \, \mathbf{E}x + b \qquad \mathbf{cov}(y) = A \, \mathbf{cov}(x) \, A^{T}$$


$$\mu_y = A \mu_x + b \quad \qquad \Sigma_y = A \Sigma_x A^{T}$$


  • The amazing fact is that $y$ is Gaussian.



1.3. Components of a Gaussian random vector (is Gaussian)


Suppose $x \sim N(0, \Sigma)$, and let $c \in \mathbb{R}^n$ be a unit vector

Let $y = c^T x$

  • $y$ is the component of $x$ in the direction $c$
  • $y$ is Gaussian, with $\mathbf{E} y = 0$ and $\mathbf{cov}= c^T \Sigma c$
  • So $\mathbf{E}\left(y^2\right) = c^T \Sigma c$
  • (PCA) The unit vector $c$ that maximizes $c^T \Sigma c$ is the eigenvector of $\Sigma$ with the largest eigenvalue. Then


$$\mathbf{E}\left(y^2\right) = \lambda_{\text{max}}$$


1.4. Conditional pdf for a Gaussian (is Gaussian)


Suppose $x \sim N(\mu, \Sigma)$, and


$$ x = \begin{bmatrix} x_1 \\ x_2 \end{bmatrix} \quad \mu = \begin{bmatrix} \mu_1 \\ \mu_2 \end{bmatrix} \quad \Sigma = \begin{bmatrix} \Sigma_{11} & \Sigma_{12} \\ \Sigma_{21} & \Sigma_{22} \end{bmatrix} $$


Suppose we measure $x_2 = y$. We would like to find the conditional pdf of $x_1$ given $x_2 = y$

  • Is it Gaussian ?
  • What is the conditional mean $\mathbf{E}(x_1 \mid x_2= y)$ ?
  • What is the conditional covariance $\mathbf{cov}(x_1 \mid x_2= y)$ ?

By the completion of squares formula


$$\Sigma^{-1} = \begin{bmatrix} I & 0 \\ -\Sigma^{-1}_{22}\Sigma_{21} & I \end{bmatrix} \begin{bmatrix} \left( \Sigma_{11} - \Sigma_{12}\Sigma^{-1}_{22}\Sigma_{21} \right)^{-1} & 0 \\ 0 & \Sigma^{-1}_{22} \end{bmatrix} \begin{bmatrix} I & -\Sigma_{12}\Sigma^{-1}_{22} \\ 0 & I \end{bmatrix}$$



If $x \sim N(0, \Sigma)$, then the conditional pdf of $x_1$ given $x_2 = y$ is Gaussian

  • The conditional mean is


$$\mathbf{E}(x_1 \mid x_2= y) = \Sigma_{12} \Sigma^{-1}_{22}\,y$$


$\quad \;$ It is a linear function of $y$.

  • The conditional convariance is


$$\mathbf{cov}(x_1 \mid x_2= y) = \Sigma_{11} - \Sigma_{12}\Sigma^{-1}_{22}\Sigma_{21} < \Sigma_{11}$$


$\quad \;$ It is not a function of $y$. Instead, it is constant.

$\quad \;$ Conditional confidence intervals are narrower. i.e., measuring $x_2$ gives information about $x_1$




If $x \sim N(\mu, \Sigma)$, then the conditional pdf of $x_1$ given $x_2 = y$ is Gaussian


$$p(x_1 \mid x_2 = y) = \mathcal{N}\left( \mu_1 + \Sigma_{12} \Sigma^{-1}_{22} \,(y-\mu_2),\; \Sigma_{11} - \Sigma_{12}\Sigma^{-1}_{22}\Sigma_{21} \right)$$

1.5 Summary

  • Closure under multiplication
    • multiple Gaussian factors form a Gaussian



  • Closure under linear maps
    • linear maps of Gaussians are Gaussians



  • Closure under marginalization
    • projections of Gausians are Gaussian



  • Closure under conditioning
    • cuts throuhg Gaussians are Gaussians


2. Linear Model


$$y = Ax + \omega$$


  • $x$ and $\omega$ are independent
  • We have induced pdfs $p_x$ and $p_{\omega}$

Estimation problem: we measure $y = y_{\text{meas}}$ and would like to estimate $x$.


$$\begin{bmatrix} x \\ y\end{bmatrix} = \begin{bmatrix} I & 0 \\ A &I\end{bmatrix} \begin{bmatrix} x \\ \omega\end{bmatrix}$$


  • We will measure $y = y_{\text{meas}}$ and estimate $x$
  • To do this, we would like the conditional pdf of $x \mid y = y_{\text{meas}}$
  • For this, we need the joint pdf of $x$ and $y$

2.1. Linear measurements with Gaussian noise


$$y = Ax + \omega$$


  • Suppose $x \sim N(0, \Sigma_x)$ and $\omega \sim N(0, \Sigma_{\omega})$
  • So $\begin{bmatrix} x \\ y\end{bmatrix}$ is Gaussian with mean and covariance


$$ \mathbf{E}\begin{bmatrix} x \\ y\end{bmatrix} = \begin{bmatrix} 0 \\ 0\end{bmatrix}, \quad \mathbf{cov} \begin{bmatrix} x \\ y\end{bmatrix} = \begin{bmatrix} \Sigma_x & \Sigma_x A^T\\ A \Sigma_x & A\Sigma_x A^T + \Sigma_{\omega}\end{bmatrix}$$


The MMSE estimate of $x$ given $y = y_{\text{meas}}$ is


$$\hat{x}_{\text{mmse}} = \mathbf{E}(x \mid y= y_{\text{meas}}) = \Sigma_{12} \Sigma^{-1}_{22}\,y = \Sigma_x A^T \left( A\Sigma_x A^T + \Sigma_{\omega}\right)^{-1} \, y_{\text{meas}}$$


The posterior covariance of $x$ given $y = y_{\text{meas}}$ is


$$ \mathbf{cov}(x \mid y= y_{\text{meas}}) = \Sigma_x - \Sigma_x A^T \left( A\Sigma_x A^T + \Sigma_{\omega}\right)^{-1} A \Sigma_x < \Sigma_x$$


Example: linear measurements with Gaussian noise


2.1.1. Signal-to-noise ratio



2.1.2. Matrix equality


$$ \mathbf{cov}(x \mid y= y_{\text{meas}}) = \Sigma_x - \Sigma_x A^T \left( A\Sigma_x A^T + \Sigma_{\omega}\right)^{-1} A \Sigma_x = \left( \Sigma_{x}^{-1} + A^T\Sigma_{\omega}^{-1} A\right)^{-1}$$


$$L=\Sigma_{12}\Sigma_{22}^{-1} = \Sigma_x A^T \left( A\Sigma_x A^T + \Sigma_{\omega} \right)^{-1} = \left( \Sigma_{x}^{-1} + A^T \Sigma_{\omega}^{-1} A \right)^{-1} A^T \Sigma_{\omega}^{-1}$$


2.1.3. Non-zero means

Suppose $x \sim N(\mu_x, \Sigma_x)$ and $\omega \sim N(\mu_{\omega}, \Sigma_{\omega})$

The MMSE estimate of $x$ given $y = y_{\text{meas}}$ is


$$\hat{x}_{\text{mmse}} =\mu_x + \Sigma_x A^T \left( A\Sigma_x A^T + \Sigma_{\omega}\right)^{-1} \, \left(y_{\text{meas}} - A\mu_x - \mu_{\omega} \right)$$

2.2. MMSE and Least-Squares

For Gaussian, the MMSE estimate is equal to the MAP estimate.

The least-squares approach minimizes


$$\lVert y - Ax\rVert^2 = \sum_{i = 1}^{m}\left( y_i - a_i^T x\right)^2$$

where $A = [a_1 \; a_2\; \cdots \; a_m]^T$


2.3. Weighted norms

Suppose instead we minimize


$$\sum_{i = 1}^{m} \omega_i \left( y_i - a_i^T x\right)^2$$

where $\omega_1,\; \omega_2,\; \cdots,\; \omega_m$ provide weights



2.3.1. Weighted Least-Squares


$$\text{minimize} \; \lVert y_{\text{meas}} - Ax\rVert_W$$


Then the optimum $x_{\text{wls}}$ is


$$x_{\text{wls}} = (A^TWA)^{-1}A^TW\, y_{\text{meas}}$$


  • $Ax_{\text{wls}}$ is the closest (in weighted-norm) point in range A to $y_{\text{meas}}$



2.3.2. MMSE and Weighted Least-Squares

Suppose we choose weight $W = \Sigma_{\omega}^{-1}$, then

WLS solution is


$$x_{\text{wls}} = (A^T\Sigma_{\omega}^{-1}A)^{-1}A^T\Sigma_{\omega}^{-1}\, y_{\text{meas}}$$


MMSE estimate is


$$x_{\text{mmse}} = \left( \Sigma_{x}^{-1} + A^T \Sigma_{\omega}^{-1} A \right)^{-1} A^T \Sigma_{\omega}^{-1} y_{\text{meas}}$$


  • as the prior covariance $\Sigma_x \rightarrow \infty$, the MMSE estimate tends to the WLS estimate
  • If $\Sigma_{\omega} = I$ then MMSE tends to usual least-squares solution as $\Sigma_x \rightarrow \infty$
  • the weighted norm heavily penalizes the residual $y-Ax$ in low-noise direction

3. Gaussian Process Regression

3.1. Bayesian linear regression

  • Why not use MLE? overfitting

  • Why not use MAP? No representation of uncertainty.

  • Why Bayesian? we can optimize loss function

$\quad \;$ It gives us $p(y \mid x, D)$. This is what we really want

Motivation

In [1]:
%%html
<center><iframe src="https://www.youtube.com/embed/dtkGq9tdYcI"
width="560" height="315" frameborder="0" allowfullscreen></iframe></center>

Weight-space view



Function-space view will be discussed at the following section.

3.2. Gaussian process regression


$$D = \left\{ (x_1, y_1),\cdots,(x_n, y_n)\right\}, \; x_i \in \mathbb{R}^d, \,y \in \mathbb{R}$$


$$ p(y_i \mid x_i, \omega) = N(y_i \mid \omega^T x_i, \sigma^2) \qquad \omega \sim N(0,\gamma I)$$


$$ y_i = \omega^T x_i + \varepsilon_i$$


$$\implies \forall x \in \mathbb{R}^d, \; \omega^Tx \;\text{ is univariate Gaussian} $$


Then $Z_x = x^T \omega$ is a GP on $S = \mathbb{R}^d$


$$\mu (x) = \mathbf{E}\,Z_x = 0$$


$$k(x,x') = \mathbf{cov}\,(Z_x, Z_{x'})= \mathbf{E}\, Z_x Z_{x'} - \mathbf{E}\, Z_x \mathbf{E}\,Z_{x'} = \mathbf{E}\,x^T \omega \omega^T x' = x^T \mathbf{E}\,\left(\omega \omega^T \right) x' = \gamma x^Tx'$$


With non-linear basis

  • $\tilde x_i = \varphi (x_i)$

Generally


$$ y_i = Z_{x_i} + \varepsilon_i$$
In [2]:
%%html
<center><iframe src="https://www.youtube.com/embed/upJ31CIVWZo" 
width="560" height="315" frameborder="0" allowfullscreen></iframe></center>

The key step of GP regresssion


$$ y_i = Z_{x_i} + \varepsilon_i$$


Let $z \sim N(\mu, K) \in \mathbb{R}^n$, and $\varepsilon \sim N(0,\sigma^2I) \in \mathbb{R}^n$ are independent


$$y = z + \varepsilon \sim N(\mu, \underbrace{K + \sigma^2 I}_{C})$$


$a = [1,\cdots, l]^T$ and $b = [l+1,\cdots, n]^T$


$$y = \begin{bmatrix} y_a \\ y_b\end{bmatrix}$$


$$\mu = \begin{bmatrix} \mu_a \\ \mu_b \end{bmatrix} \quad C = \begin{bmatrix} C_{aa} & C_{ab} \\ C_{ba} & C_{bb}\end{bmatrix} \\ K = \begin{bmatrix} K_{aa} & K_{ab} \\ K_{ba} & K_{bb} \end{bmatrix} $$


Conditional distribution $p(y_a \mid y_b) \sim N(m,D)$

where


$$ \begin{align*} m &= \mu_a + C_{ab}C_{bb}^{-1}(y_b - \mu_b) \\ D &= C_{aa} - C_{ab}C_{bb}^{-1}C_{ba} \end{align*} $$


$$ \begin{align*} m &= \mu_a + K_{ab}(K_{bb}+\sigma^2I)^{-1}(y_b - \mu_b) \\ D &= (K_{aa} + \sigma^2 I) - K_{ab}(K_{bb}+\sigma^2I)^{-1}K_{ba} \end{align*} $$
In [3]:
%%html
<center><iframe src="https://www.youtube.com/embed/UH1d2mxwet8" 
width="560" height="315" frameborder="0" allowfullscreen></iframe></center>
In [4]:
%%html
<center><iframe src="https://www.youtube.com/embed/JdZr74mtZkU" 
width="560" height="315" frameborder="0" allowfullscreen></iframe></center>

Function-space view

  • Squared exponential (~RBF) kernel



Gaussian Processes: From the Basics to the State-of-the-Art, Imperial's ML tutorial series

In [5]:
%%html
<center><iframe src="https://www.youtube.com/embed/92-98SYOdlY" 
width="560" height="315" frameborder="0" allowfullscreen></iframe></center>

4. GP Regression with Python

we can describe a Gaussian process as a distribution over functions. Just as a multivariate normal distribution is completely specified by a mean vector and covariance matrix, a GP is fully specified by a mean function and a covariance function:


$$p(x) \sim \mathcal{GP} \left(m(x),k \left(x,x' \right) \right)$$


For example, one specification of a GP might be:


$$ \begin{align*} m(x) &= 0\\ k(x,x') &= \theta_1 \exp\left( -\frac{\theta_2}{2} \left(x-x' \right)^2 \right) \end{align*} $$


Here, the covariance function is a squared exponential, for which values of $x$ and $x^{\prime}$ that are close together result in values of k closer to one, while those that are far apart return values closer to zero.

In [6]:
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline


$$k(x,x') = \theta_1 \exp\left( -\frac{\theta_2}{2} \left(x-x' \right)^2 \right)$$

In [7]:
def exponential_cov(x, y, params):
    return params[0] * np.exp( - params[1]/2 * np.subtract.outer(x, y)**2)


$$p(x \mid y) = \mathcal{N}\left( \mu_x + \Sigma_{xy} \Sigma^{-1}_{y} \,(y-\mu_y),\; \Sigma_{x} - \Sigma_{xy}\Sigma^{-1}_{y}\Sigma_{xy}^T \right)$$

In [8]:
def conditional(x_new, x, y, params):

    B = exponential_cov(x_new, x, params)
    C = exponential_cov(x, x, params)
    A = exponential_cov(x_new, x_new, params)

    mu = np.linalg.inv(C).dot(B.T).T.dot(y)
    sigma = A - B.dot(np.linalg.inv(C).dot(B.T))

    return(mu.squeeze(), sigma.squeeze())
In [9]:
theta = [1, 5]
sigma = exponential_cov(0, 0, theta)
xpts = np.arange(-3, 3, step=0.01)
plt.errorbar(xpts, np.zeros(len(xpts)), yerr=sigma, capsize=0)
Out[9]:
<Container object of 3 artists>
In [10]:
x = [1.]
y = [np.random.normal(scale=sigma)]
y
Out[10]:
[-0.2553116444435854]
In [11]:
sigma_1 = exponential_cov(x, x, theta)

def predict(x, data, kernel, params, sigma, t):
    k = [kernel(x, y, params) for y in data]
    Sinv = np.linalg.inv(sigma)
    y_pred = np.dot(k, Sinv).dot(t)
    sigma_new = kernel(x, x, params) - np.dot(k, Sinv).dot(k)
    return y_pred, sigma_new

x_pred = np.linspace(-3, 3, 1000)
predictions = [predict(i, x, exponential_cov, theta, sigma_1, y) for i in x_pred]
In [12]:
y_pred, sigmas = np.transpose(predictions)
plt.errorbar(x_pred, y_pred, yerr=sigmas, capsize=0)
plt.plot(x, y, "ro")
Out[12]:
[<matplotlib.lines.Line2D at 0x1e8d3ebf518>]
In [13]:
m, s = conditional([-0.7], x, y, theta)
y2 = np.random.normal(m, s)
y2
Out[13]:
0.24147030277027695
In [14]:
x.append(-0.7)
y.append(y2)

sigma_2 = exponential_cov(x, x, theta)
predictions = [predict(i, x, exponential_cov, theta, sigma_2, y) for i in x_pred]
In [15]:
y_pred, sigmas = np.transpose(predictions)
plt.errorbar(x_pred, y_pred, yerr=sigmas, capsize=0)
plt.plot(x, y, "ro")
Out[15]:
[<matplotlib.lines.Line2D at 0x1e8d1dfe438>]
In [16]:
x_more = [-2.1, -1.5, 0.3, 1.8, 2.5]
mu, s = conditional(x_more, x, y, theta)
y_more = np.random.multivariate_normal(mu, s)
y_more
Out[16]:
array([ 1.39778923, -0.32097195,  0.61480776, -1.1724211 , -0.95255744])
In [17]:
x += x_more
y += y_more.tolist()

sigma_new = exponential_cov(x, x, theta)
predictions = [predict(i, x, exponential_cov, theta, sigma_new, y) for i in x_pred]

y_pred, sigmas = np.transpose(predictions)
plt.errorbar(x_pred, y_pred, yerr=sigmas, capsize=0)
plt.plot(x, y, "ro")
Out[17]:
[<matplotlib.lines.Line2D at 0x1e8d3f8b4e0>]

There are a number of libraries available for specifying and fitting GP models in a more automated way.

  • scikit-learn (we will explore this)
  • GPflow
  • PyMC3
In [18]:
from sklearn.gaussian_process import GaussianProcessRegressor
from sklearn.gaussian_process.kernels import Matern, WhiteKernel, ConstantKernel
In [19]:
x = np.linspace(0,10,100)
r = np.sin(x)

plt.plot(x,r)
plt.show()
In [20]:
y = r + 1*np.random.rand(100)
plt.plot(x,y,'.')
Out[20]:
[<matplotlib.lines.Line2D at 0x1e8d5d49588>]

scikit-learn offers a library of about kernels to choose from. A flexible choice to start with is the Matern covariance.

In [21]:
kernel = ConstantKernel() + Matern(length_scale=2, nu=3/2) + WhiteKernel(noise_level=1)
In [22]:
X = x.reshape(-1,1)
In [23]:
gp = GaussianProcessRegressor(kernel = kernel)
In [24]:
gp.fit(X,y)
Out[24]:
GaussianProcessRegressor(alpha=1e-10, copy_X_train=True,
             kernel=1**2 + Matern(length_scale=2, nu=1.5) + WhiteKernel(noise_level=1),
             n_restarts_optimizer=0, normalize_y=False,
             optimizer='fmin_l_bfgs_b', random_state=None)
In [25]:
gp.kernel_
Out[25]:
0.00316**2 + Matern(length_scale=2.67, nu=1.5) + WhiteKernel(noise_level=0.0689)
In [26]:
x_pred = np.linspace(0, 10, 100).reshape(-1,1)
y_pred, sigma = gp.predict(x_pred, return_std=True)
In [27]:
plt.plot(x_pred, y_pred)
Out[27]:
[<matplotlib.lines.Line2D at 0x1e8d5d8e940>]
In [28]:
%%javascript
$.getScript('https://kmahelona.github.io/ipython_notebook_goodies/ipython_notebook_toc.js')