Gaussian Process



By Prof. Seungchul Lee
http://iai.postech.ac.kr/
Industrial AI Lab at POSTECH

1. Gaussian Random Vectors


Suppose xN(μ,Σ), here Σ=ΣT and Σ>0.



p(x)=1(2π)n2(detΣ)12exp(12(xμ)TΣ1(xμ))


1.1. The marginal pdf of a Gaussian (is Gaussian)


Suppose xN(μ,Σ), and


x=[x1x2]μ=[μ1μ2]Σ=[Σ11Σ12Σ21Σ22]


Let's look at the component x1=[I0]x

  • mean of x1


Ex1=[I0]μ=μ1


  • convariance of x1


cov(x1)=[I0]Σ[I0]=Σ11
  • In fact, the random variable x1 is Gaussian, (this is not obvious.)


1.2. Linear transformation of Gaussian (is Gaussian)


Suppose xN(μx,Σx). Consider the linear function of x


y=Ax+b


  • We already know how means and covariances transform. We have


Ey=AEx+bcov(y)=Acov(x)AT


μy=Aμx+bΣy=AΣxAT


  • The amazing fact is that y is Gaussian.



1.3. Components of a Gaussian random vector (is Gaussian)


Suppose xN(0,Σ), and let cRn be a unit vector

Let y=cTx

  • y is the component of x in the direction c
  • y is Gaussian, with Ey=0 and cov=cTΣc
  • So E(y2)=cTΣc
  • (PCA) The unit vector c that maximizes cTΣc is the eigenvector of Σ with the largest eigenvalue. Then


E(y2)=λmax


1.4. Conditional pdf for a Gaussian (is Gaussian)


Suppose xN(μ,Σ), and


x=[x1x2]μ=[μ1μ2]Σ=[Σ11Σ12Σ21Σ22]


Suppose we measure x2=y. We would like to find the conditional pdf of x1 given x2=y

  • Is it Gaussian ?
  • What is the conditional mean E(x1x2=y) ?
  • What is the conditional covariance cov(x1x2=y) ?

By the completion of squares formula


Σ1=[I0Σ122Σ21I][(Σ11Σ12Σ122Σ21)100Σ122][IΣ12Σ1220I]



If xN(0,Σ), then the conditional pdf of x1 given x2=y is Gaussian

  • The conditional mean is


E(x1x2=y)=Σ12Σ122y


It is a linear function of y.

  • The conditional convariance is


cov(x1x2=y)=Σ11Σ12Σ122Σ21<Σ11


It is not a function of y. Instead, it is constant.

Conditional confidence intervals are narrower. i.e., measuring x2 gives information about x1




If xN(μ,Σ), then the conditional pdf of x1 given x2=y is Gaussian


p(x1x2=y)=N(μ1+Σ12Σ122(yμ2),Σ11Σ12Σ122Σ21)

1.5 Summary

  • Closure under multiplication
    • multiple Gaussian factors form a Gaussian



  • Closure under linear maps
    • linear maps of Gaussians are Gaussians



  • Closure under marginalization
    • projections of Gausians are Gaussian



  • Closure under conditioning
    • cuts throuhg Gaussians are Gaussians


2. Linear Model


y=Ax+ω


  • x and ω are independent
  • We have induced pdfs px and pω

Estimation problem: we measure y=ymeas and would like to estimate x.


[xy]=[I0AI][xω]


  • We will measure y=ymeas and estimate x
  • To do this, we would like the conditional pdf of xy=ymeas
  • For this, we need the joint pdf of x and y

2.1. Linear measurements with Gaussian noise


y=Ax+ω


  • Suppose xN(0,Σx) and ωN(0,Σω)
  • So [xy] is Gaussian with mean and covariance


E[xy]=[00],cov[xy]=[ΣxΣxATAΣxAΣxAT+Σω]


The MMSE estimate of x given y=ymeas is


ˆxmmse=E(xy=ymeas)=Σ12Σ122y=ΣxAT(AΣxAT+Σω)1ymeas


The posterior covariance of x given y=ymeas is


cov(xy=ymeas)=ΣxΣxAT(AΣxAT+Σω)1AΣx<Σx


Example: linear measurements with Gaussian noise


2.1.1. Signal-to-noise ratio



2.1.2. Matrix equality


cov(xy=ymeas)=ΣxΣxAT(AΣxAT+Σω)1AΣx=(Σ1x+ATΣ1ωA)1


L=Σ12Σ122=ΣxAT(AΣxAT+Σω)1=(Σ1x+ATΣ1ωA)1ATΣ1ω


2.1.3. Non-zero means

Suppose xN(μx,Σx) and ωN(μω,Σω)

The MMSE estimate of x given y=ymeas is


ˆxmmse=μx+ΣxAT(AΣxAT+Σω)1(ymeasAμxμω)

2.2. MMSE and Least-Squares

For Gaussian, the MMSE estimate is equal to the MAP estimate.

The least-squares approach minimizes


yAx2=mi=1(yiaTix)2

where A=[a1a2am]T


2.3. Weighted norms

Suppose instead we minimize


mi=1ωi(yiaTix)2

where ω1,ω2,,ωm provide weights



2.3.1. Weighted Least-Squares


minimizeymeasAxW


Then the optimum xwls is


xwls=(ATWA)1ATWymeas


  • Axwls is the closest (in weighted-norm) point in range A to ymeas



2.3.2. MMSE and Weighted Least-Squares

Suppose we choose weight W=Σ1ω, then

WLS solution is


xwls=(ATΣ1ωA)1ATΣ1ωymeas


MMSE estimate is


xmmse=(Σ1x+ATΣ1ωA)1ATΣ1ωymeas


  • as the prior covariance Σx, the MMSE estimate tends to the WLS estimate
  • If Σω=I then MMSE tends to usual least-squares solution as Σx
  • the weighted norm heavily penalizes the residual yAx in low-noise direction

3. Gaussian Process Regression

3.1. Bayesian linear regression

  • Why not use MLE? overfitting

  • Why not use MAP? No representation of uncertainty.

  • Why Bayesian? we can optimize loss function

It gives us p(yx,D). This is what we really want

Motivation

In [1]:
%%html
<center><iframe src="https://www.youtube.com/embed/dtkGq9tdYcI"
width="560" height="315" frameborder="0" allowfullscreen></iframe></center>

Weight-space view



Function-space view will be discussed at the following section.

3.2. Gaussian process regression


D={(x1,y1),,(xn,yn)},xiRd,yR


p(yixi,ω)=N(yiωTxi,σ2)ωN(0,γI)


yi=ωTxi+εi


xRd,ωTx is univariate Gaussian


Then Zx=xTω is a GP on S=Rd


μ(x)=EZx=0


k(x,x)=cov(Zx,Zx)=EZxZxEZxEZx=ExTωωTx=xTE(ωωT)x=γxTx


With non-linear basis

  • ˜xi=φ(xi)

Generally


yi=Zxi+εi
In [2]:
%%html
<center><iframe src="https://www.youtube.com/embed/upJ31CIVWZo" 
width="560" height="315" frameborder="0" allowfullscreen></iframe></center>

The key step of GP regresssion


yi=Zxi+εi


Let zN(μ,K)Rn, and εN(0,σ2I)Rn are independent


y=z+εN(μ,K+σ2IC)


a=[1,,l]T and b=[l+1,,n]T


y=[yayb]


μ=[μaμb]C=[CaaCabCbaCbb]K=[KaaKabKbaKbb]


Conditional distribution p(yayb)N(m,D)

where


m=μa+CabC1bb(ybμb)D=CaaCabC1bbCba


m=μa+Kab(Kbb+σ2I)1(ybμb)D=(Kaa+σ2I)Kab(Kbb+σ2I)1Kba
In [3]:
%%html
<center><iframe src="https://www.youtube.com/embed/UH1d2mxwet8" 
width="560" height="315" frameborder="0" allowfullscreen></iframe></center>
In [4]:
%%html
<center><iframe src="https://www.youtube.com/embed/JdZr74mtZkU" 
width="560" height="315" frameborder="0" allowfullscreen></iframe></center>

Function-space view

  • Squared exponential (~RBF) kernel



Gaussian Processes: From the Basics to the State-of-the-Art, Imperial's ML tutorial series

In [5]:
%%html
<center><iframe src="https://www.youtube.com/embed/92-98SYOdlY" 
width="560" height="315" frameborder="0" allowfullscreen></iframe></center>

4. GP Regression with Python

we can describe a Gaussian process as a distribution over functions. Just as a multivariate normal distribution is completely specified by a mean vector and covariance matrix, a GP is fully specified by a mean function and a covariance function:


p(x)GP(m(x),k(x,x))


For example, one specification of a GP might be:


m(x)=0k(x,x)=θ1exp(θ22(xx)2)


Here, the covariance function is a squared exponential, for which values of x and x that are close together result in values of k closer to one, while those that are far apart return values closer to zero.

In [6]:
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline


k(x,x)=θ1exp(θ22(xx)2)

In [7]:
def exponential_cov(x, y, params):
    return params[0] * np.exp( - params[1]/2 * np.subtract.outer(x, y)**2)


p(xy)=N(μx+ΣxyΣ1y(yμy),ΣxΣxyΣ1yΣTxy)

In [8]:
def conditional(x_new, x, y, params):

    B = exponential_cov(x_new, x, params)
    C = exponential_cov(x, x, params)
    A = exponential_cov(x_new, x_new, params)

    mu = np.linalg.inv(C).dot(B.T).T.dot(y)
    sigma = A - B.dot(np.linalg.inv(C).dot(B.T))

    return(mu.squeeze(), sigma.squeeze())
In [9]:
theta = [1, 5]
sigma = exponential_cov(0, 0, theta)
xpts = np.arange(-3, 3, step=0.01)
plt.errorbar(xpts, np.zeros(len(xpts)), yerr=sigma, capsize=0)
Out[9]:
<Container object of 3 artists>
In [10]:
x = [1.]
y = [np.random.normal(scale=sigma)]
y
Out[10]:
[-0.2553116444435854]
In [11]:
sigma_1 = exponential_cov(x, x, theta)

def predict(x, data, kernel, params, sigma, t):
    k = [kernel(x, y, params) for y in data]
    Sinv = np.linalg.inv(sigma)
    y_pred = np.dot(k, Sinv).dot(t)
    sigma_new = kernel(x, x, params) - np.dot(k, Sinv).dot(k)
    return y_pred, sigma_new

x_pred = np.linspace(-3, 3, 1000)
predictions = [predict(i, x, exponential_cov, theta, sigma_1, y) for i in x_pred]
In [12]:
y_pred, sigmas = np.transpose(predictions)
plt.errorbar(x_pred, y_pred, yerr=sigmas, capsize=0)
plt.plot(x, y, "ro")
Out[12]:
[<matplotlib.lines.Line2D at 0x1e8d3ebf518>]
In [13]:
m, s = conditional([-0.7], x, y, theta)
y2 = np.random.normal(m, s)
y2
Out[13]:
0.24147030277027695
In [14]:
x.append(-0.7)
y.append(y2)

sigma_2 = exponential_cov(x, x, theta)
predictions = [predict(i, x, exponential_cov, theta, sigma_2, y) for i in x_pred]
In [15]:
y_pred, sigmas = np.transpose(predictions)
plt.errorbar(x_pred, y_pred, yerr=sigmas, capsize=0)
plt.plot(x, y, "ro")
Out[15]:
[<matplotlib.lines.Line2D at 0x1e8d1dfe438>]
In [16]:
x_more = [-2.1, -1.5, 0.3, 1.8, 2.5]
mu, s = conditional(x_more, x, y, theta)
y_more = np.random.multivariate_normal(mu, s)
y_more
Out[16]:
array([ 1.39778923, -0.32097195,  0.61480776, -1.1724211 , -0.95255744])
In [17]:
x += x_more
y += y_more.tolist()

sigma_new = exponential_cov(x, x, theta)
predictions = [predict(i, x, exponential_cov, theta, sigma_new, y) for i in x_pred]

y_pred, sigmas = np.transpose(predictions)
plt.errorbar(x_pred, y_pred, yerr=sigmas, capsize=0)
plt.plot(x, y, "ro")
Out[17]:
[<matplotlib.lines.Line2D at 0x1e8d3f8b4e0>]

There are a number of libraries available for specifying and fitting GP models in a more automated way.

  • scikit-learn (we will explore this)
  • GPflow
  • PyMC3
In [18]:
from sklearn.gaussian_process import GaussianProcessRegressor
from sklearn.gaussian_process.kernels import Matern, WhiteKernel, ConstantKernel
In [19]:
x = np.linspace(0,10,100)
r = np.sin(x)

plt.plot(x,r)
plt.show()
In [20]:
y = r + 1*np.random.rand(100)
plt.plot(x,y,'.')
Out[20]:
[<matplotlib.lines.Line2D at 0x1e8d5d49588>]

scikit-learn offers a library of about kernels to choose from. A flexible choice to start with is the Matern covariance.

In [21]:
kernel = ConstantKernel() + Matern(length_scale=2, nu=3/2) + WhiteKernel(noise_level=1)
In [22]:
X = x.reshape(-1,1)
In [23]:
gp = GaussianProcessRegressor(kernel = kernel)
In [24]:
gp.fit(X,y)
Out[24]:
GaussianProcessRegressor(alpha=1e-10, copy_X_train=True,
             kernel=1**2 + Matern(length_scale=2, nu=1.5) + WhiteKernel(noise_level=1),
             n_restarts_optimizer=0, normalize_y=False,
             optimizer='fmin_l_bfgs_b', random_state=None)
In [25]:
gp.kernel_
Out[25]:
0.00316**2 + Matern(length_scale=2.67, nu=1.5) + WhiteKernel(noise_level=0.0689)
In [26]:
x_pred = np.linspace(0, 10, 100).reshape(-1,1)
y_pred, sigma = gp.predict(x_pred, return_std=True)
In [27]:
plt.plot(x_pred, y_pred)
Out[27]:
[<matplotlib.lines.Line2D at 0x1e8d5d8e940>]
In [28]:
%%javascript
$.getScript('https://kmahelona.github.io/ipython_notebook_goodies/ipython_notebook_toc.js')