Gaussian Process
Source
Table of Contents
Suppose $x \sim N(\mu, \Sigma)$, here $\Sigma = \Sigma^T$ and $\Sigma > 0$.
Suppose $x \sim N(\mu, \Sigma)$, and
Let's look at the component $x_1 = \begin{bmatrix} I & 0 \end{bmatrix} x$
Suppose $x \sim N(\mu_x, \Sigma_x)$. Consider the linear function of $x$
Suppose $x \sim N(0, \Sigma)$, and let $c \in \mathbb{R}^n$ be a unit vector
Let $y = c^T x$
$$\mathbf{E}\left(y^2\right) = \lambda_{\text{max}}$$
Suppose $x \sim N(\mu, \Sigma)$, and
Suppose we measure $x_2 = y$. We would like to find the conditional pdf of $x_1$ given $x_2 = y$
By the completion of squares formula
If $x \sim N(0, \Sigma)$, then the conditional pdf of $x_1$ given $x_2 = y$ is Gaussian
$\quad \;$ It is a linear function of $y$.
$\quad \;$ It is not a function of $y$. Instead, it is constant.
$\quad \;$ Conditional confidence intervals are narrower. i.e., measuring $x_2$ gives information about $x_1$
If $x \sim N(\mu, \Sigma)$, then the conditional pdf of $x_1$ given $x_2 = y$ is Gaussian
Estimation problem: we measure $y = y_{\text{meas}}$ and would like to estimate $x$.
The MMSE estimate of $x$ given $y = y_{\text{meas}}$ is
The posterior covariance of $x$ given $y = y_{\text{meas}}$ is
Example: linear measurements with Gaussian noise
Suppose $x \sim N(\mu_x, \Sigma_x)$ and $\omega \sim N(\mu_{\omega}, \Sigma_{\omega})$
The MMSE estimate of $x$ given $y = y_{\text{meas}}$ is
For Gaussian, the MMSE estimate is equal to the MAP estimate.
The least-squares approach minimizes
where $A = [a_1 \; a_2\; \cdots \; a_m]^T$
Suppose instead we minimize
where $\omega_1,\; \omega_2,\; \cdots,\; \omega_m$ provide weights
Then the optimum $x_{\text{wls}}$ is
Suppose we choose weight $W = \Sigma_{\omega}^{-1}$, then
WLS solution is
MMSE estimate is
%%html
<center><iframe src="https://www.youtube.com/embed/dtkGq9tdYcI"
width="560" height="315" frameborder="0" allowfullscreen></iframe></center>
Weight-space view
Function-space view will be discussed at the following section.
Then $Z_x = x^T \omega$ is a GP on $S = \mathbb{R}^d$
With non-linear basis
Generally
%%html
<center><iframe src="https://www.youtube.com/embed/upJ31CIVWZo"
width="560" height="315" frameborder="0" allowfullscreen></iframe></center>
The key step of GP regresssion
Let $z \sim N(\mu, K) \in \mathbb{R}^n$, and $\varepsilon \sim N(0,\sigma^2I) \in \mathbb{R}^n$ are independent
$a = [1,\cdots, l]^T$ and $b = [l+1,\cdots, n]^T$
Conditional distribution $p(y_a \mid y_b) \sim N(m,D)$
where
%%html
<center><iframe src="https://www.youtube.com/embed/UH1d2mxwet8"
width="560" height="315" frameborder="0" allowfullscreen></iframe></center>
%%html
<center><iframe src="https://www.youtube.com/embed/JdZr74mtZkU"
width="560" height="315" frameborder="0" allowfullscreen></iframe></center>
Function-space view
Gaussian Processes: From the Basics to the State-of-the-Art, Imperial's ML tutorial series
%%html
<center><iframe src="https://www.youtube.com/embed/92-98SYOdlY"
width="560" height="315" frameborder="0" allowfullscreen></iframe></center>
we can describe a Gaussian process as a distribution over functions. Just as a multivariate normal distribution is completely specified by a mean vector and covariance matrix, a GP is fully specified by a mean function and a covariance function:
For example, one specification of a GP might be:
Here, the covariance function is a squared exponential, for which values of $x$ and $x^{\prime}$ that are close together result in values of k closer to one, while those that are far apart return values closer to zero.
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
$$k(x,x') = \theta_1 \exp\left( -\frac{\theta_2}{2} \left(x-x' \right)^2 \right)$$
def exponential_cov(x, y, params):
return params[0] * np.exp( - params[1]/2 * np.subtract.outer(x, y)**2)
$$p(x \mid y) = \mathcal{N}\left( \mu_x + \Sigma_{xy} \Sigma^{-1}_{y} \,(y-\mu_y),\; \Sigma_{x} - \Sigma_{xy}\Sigma^{-1}_{y}\Sigma_{xy}^T \right)$$
def conditional(x_new, x, y, params):
B = exponential_cov(x_new, x, params)
C = exponential_cov(x, x, params)
A = exponential_cov(x_new, x_new, params)
mu = np.linalg.inv(C).dot(B.T).T.dot(y)
sigma = A - B.dot(np.linalg.inv(C).dot(B.T))
return(mu.squeeze(), sigma.squeeze())
theta = [1, 5]
sigma = exponential_cov(0, 0, theta)
xpts = np.arange(-3, 3, step=0.01)
plt.errorbar(xpts, np.zeros(len(xpts)), yerr=sigma, capsize=0)
x = [1.]
y = [np.random.normal(scale=sigma)]
y
sigma_1 = exponential_cov(x, x, theta)
def predict(x, data, kernel, params, sigma, t):
k = [kernel(x, y, params) for y in data]
Sinv = np.linalg.inv(sigma)
y_pred = np.dot(k, Sinv).dot(t)
sigma_new = kernel(x, x, params) - np.dot(k, Sinv).dot(k)
return y_pred, sigma_new
x_pred = np.linspace(-3, 3, 1000)
predictions = [predict(i, x, exponential_cov, theta, sigma_1, y) for i in x_pred]
y_pred, sigmas = np.transpose(predictions)
plt.errorbar(x_pred, y_pred, yerr=sigmas, capsize=0)
plt.plot(x, y, "ro")
m, s = conditional([-0.7], x, y, theta)
y2 = np.random.normal(m, s)
y2
x.append(-0.7)
y.append(y2)
sigma_2 = exponential_cov(x, x, theta)
predictions = [predict(i, x, exponential_cov, theta, sigma_2, y) for i in x_pred]
y_pred, sigmas = np.transpose(predictions)
plt.errorbar(x_pred, y_pred, yerr=sigmas, capsize=0)
plt.plot(x, y, "ro")
x_more = [-2.1, -1.5, 0.3, 1.8, 2.5]
mu, s = conditional(x_more, x, y, theta)
y_more = np.random.multivariate_normal(mu, s)
y_more
x += x_more
y += y_more.tolist()
sigma_new = exponential_cov(x, x, theta)
predictions = [predict(i, x, exponential_cov, theta, sigma_new, y) for i in x_pred]
y_pred, sigmas = np.transpose(predictions)
plt.errorbar(x_pred, y_pred, yerr=sigmas, capsize=0)
plt.plot(x, y, "ro")
There are a number of libraries available for specifying and fitting GP models in a more automated way.
scikit-learn
(we will explore this)GPflow
PyMC3
from sklearn.gaussian_process import GaussianProcessRegressor
from sklearn.gaussian_process.kernels import Matern, WhiteKernel, ConstantKernel
x = np.linspace(0,10,100)
r = np.sin(x)
plt.plot(x,r)
plt.show()
y = r + 1*np.random.rand(100)
plt.plot(x,y,'.')
scikit-learn
offers a library of about kernels to choose from. A flexible choice to start with is the Matern covariance.
kernel = ConstantKernel() + Matern(length_scale=2, nu=3/2) + WhiteKernel(noise_level=1)
X = x.reshape(-1,1)
gp = GaussianProcessRegressor(kernel = kernel)
gp.fit(X,y)
gp.kernel_
x_pred = np.linspace(0, 10, 100).reshape(-1,1)
y_pred, sigma = gp.predict(x_pred, return_std=True)
plt.plot(x_pred, y_pred)
%%javascript
$.getScript('https://kmahelona.github.io/ipython_notebook_goodies/ipython_notebook_toc.js')