Gaussian Process

Source

S. Lall at Stanford (http://lall.stanford.edu/)
Jeffrey Miller at Harvard (http://jwmi.github.io/)
Andrew Ng at Stanford
Philipp Hennig at the University of Tübingen
- https://www.youtube.com/watch?v=50Vgw11qn0o
Richard Turner at the University of Cambridge (https://rich-turner-group.github.io/)
- https://www.youtube.com/watch?v=92-98SYOdlY
Example and code
- https://blog.dominodatalab.com/fitting-gaussian-process-models-python/

Table of Contents

1. Gaussian Random Vectors¶

Suppose $x \sim \mathcal{N}(\mu, \Sigma)$, where $\Sigma = \Sigma^T$ and $\Sigma > 0$. The probability density function is:

$$ p(x) = \frac{1}{(2\pi)^{\frac{n}{2}}(\det \Sigma)^{\frac{1}{2}}} \exp\left(-\frac{1}{2}(x-\mu)^T \Sigma^{-1}(x-\mu)\right) $$

1.1. Marginal Distribution of a Gaussian Vector¶

Let:

$$ x = \begin{bmatrix} x_1 \\ x_2 \end{bmatrix}, \quad \mu = \begin{bmatrix} \mu_1 \\ \mu_2 \end{bmatrix}, \quad \Sigma = \begin{bmatrix} \Sigma_{11} & \Sigma_{12} \\ \Sigma_{21} & \Sigma_{22} \end{bmatrix} $$

Then:

$$ x_1 = \begin{bmatrix} 1 & 0 \end{bmatrix} x $$

Mean:

$$ \mathbb{E}[x_1] = \begin{bmatrix} 1 & 0 \end{bmatrix} \mu = \mu_1 $$

Covariance:

$$ \operatorname{cov}(x_1) = \begin{bmatrix} 1 & 0 \end{bmatrix} \Sigma \begin{bmatrix} 1 \\ 0 \end{bmatrix} = \Sigma_{11} $$

Thus, $x_1$ is Gaussian, though this may not be immediately obvious.

1.2. Linear Transformations of Gaussian Vectors¶

If $x \sim \mathcal{N}(\mu_x, \Sigma_x)$ and $y = Ax + b$, then:

$$ \mathbb{E}[y] = A \mu_x + b, \quad \operatorname{cov}(y) = A \Sigma_x A^T $$

Thus:

$$ y \sim \mathcal{N}(A\mu_x + b, A\Sigma_x A^T) $$

The amazing fact is that $y$ is Gaussian.

1.3. Components of a Gaussian Vector¶

Consider a random vector $x \sim \mathcal{N}(0, \Sigma)$ and a vector $c \in \mathbb{R}^n$. Define the scalar projection of $x$ onto $c$:

$$ y = c^T x $$

This represents the component of $x$ in the direction $c$.

Since $x$ is Gaussian and the projection is linear, $y$ is also Gaussian.

The mean of $y$ is:

$$ \mathbb{E}[y] = c^T \mathbb{E}[x] = 0 $$

The variance of $y$ is:

$$ \operatorname{var}(y) = c^T \operatorname{cov}(x) c = c^T \Sigma c $$

Thus:

$$ \mathbb{E}[y^2] = c^T \Sigma c $$

In Principal Component Analysis (PCA), the direction $c$ that maximizes $\mathbb{E}[y^2]$ is the eigenvector of $\Sigma$ corresponding to the largest eigenvalue $\lambda_{\max}$. Therefore:

$$ \max_c \mathbb{E}[y^2] = \lambda_{\max} $$

This shows that the largest variance occurs along the principal direction of the Gaussian distribution.

1.4. Conditional Distribution of a Gaussian Vector¶

Suppose $x \sim \mathcal{N}(\mu, \Sigma)$ with

$$ x = \begin{bmatrix} x_1 \\ x_2 \end{bmatrix}, \quad \mu = \begin{bmatrix} \mu_1 \\ \mu_2 \end{bmatrix}, \quad \Sigma = \begin{bmatrix} \Sigma_{11} & \Sigma_{12} \\ \Sigma_{21} & \Sigma_{22} \end{bmatrix}. $$

Suppose we observe $x_2 = y$. We want to determine:

Is $x_1 \mid x_2=y$ Gaussian?
What is its conditional mean $\mathbb{E}[x_1 \mid x_2=y]$?
What is its conditional covariance $\operatorname{cov}(x_1 \mid x_2=y)$?

Our goal is to determine the conditional distribution $p(x_1 \mid x_2 = y)$.

Step 1: Joint PDF of $x$

The joint pdf is

$$ p(x) \propto \exp\left(-\frac{1}{2}(x-\mu)^T \Sigma^{-1}(x-\mu)\right). $$

Partition the terms:

$$ x - \mu = \begin{bmatrix} x_1 - \mu_1 \\ x_2 - \mu_2 \end{bmatrix}, \quad \Sigma^{-1} = \begin{bmatrix} A & B \\ B^T & C \end{bmatrix}. $$

Expand the quadratic form:

$$ Q = (x_1-\mu_1)^T A (x_1-\mu_1) + 2(x_1-\mu_1)^T B (x_2-\mu_2) + (x_2-\mu_2)^T C (x_2-\mu_2). $$

Step 2: Completion of Squares

$$ Q = \left(x_1 - \mu_1 + A^{-1} B (x_2 - \mu_2)\right)^T A \left(x_1 - \mu_1 + A^{-1} B (x_2 - \mu_2)\right) + (x_2 - \mu_2)^T (C - B^T A^{-1} B) (x_2 - \mu_2). $$

Thus, the joint pdf becomes

$$ p(x_1, x_2) \propto \exp\left(-\frac{1}{2}(x_2 - \mu_2)^T (C - B^T A^{-1} B)(x_2 - \mu_2)\right) \exp\left(-\frac{1}{2}(x_1 - \mu_1 + A^{-1} B (x_2 - \mu_2))^T A (x_1 - \mu_1 + A^{-1} B (x_2 - \mu_2))\right). $$

Step 3: Conditional PDF

For fixed $x_2=y$:

$$ p(x_1 \mid x_2=y) \propto \exp\left(-\frac{1}{2}(x_1 - \mu_1 + A^{-1}B(y - \mu_2))^T A (x_1 - \mu_1 + A^{-1}B(y - \mu_2))\right). $$

This is a Gaussian distribution:

$$ x_1 \mid x_2=y \; \sim \; \mathcal{N}(\mu_{1|2}, \Sigma_{1|2}), $$

where

$$ \mu_{1|2} = \mu_1 - A^{-1} B (y - \mu_2), \quad \Sigma_{1|2} = A^{-1}. $$

Step 4: Express in Terms of $\Sigma$

It is known that

$$ \Sigma^{-1} = \begin{bmatrix} I & 0 \\ -\Sigma_{22}^{-1}\Sigma_{21} & I \end{bmatrix} \begin{bmatrix} (\Sigma_{11}-\Sigma_{12}\Sigma_{22}^{-1}\Sigma_{21})^{-1} & 0 \\ 0 & \Sigma_{22}^{-1} \end{bmatrix} \begin{bmatrix} I & -\Sigma_{12}\Sigma_{22}^{-1} \\ 0 & I \end{bmatrix}. $$

This gives

$$ A = (\Sigma_{11} - \Sigma_{12}\Sigma_{22}^{-1}\Sigma_{21})^{-1}, \quad B = -A \Sigma_{12} \Sigma_{22}^{-1}. $$

Thus,

$$ \begin{align*} \mu_{1|2} &= \mu_1 + \Sigma_{12}\Sigma_{22}^{-1}(y-\mu_2) \\\\ \Sigma_{1|2} &= \Sigma_{11} - \Sigma_{12}\Sigma_{22}^{-1}\Sigma_{21}. \end{align*} $$

Step 5: Conclusion

$$ p(x_1 \mid x_2=y) = \mathcal{N}\left(\mu_1 + \Sigma_{12}\Sigma_{22}^{-1}(y - \mu_2), \;\; \Sigma_{11} - \Sigma_{12}\Sigma_{22}^{-1}\Sigma_{21}\right). $$

The conditional distribution is Gaussian.
The conditional mean shifts linearly with $y$.
The conditional covariance is reduced and independent of $y$, reflecting the information gained by observing $x_2$.

Let's consider a simpler case where $x \sim \mathcal{N}(0, \Sigma)$. In this case, the conditional distribution of $x_1$ given $x_2 = y$ is also Gaussian.

The conditional mean is:

$$ \mathbb{E}[x_1 \mid x_2 = y] = \Sigma_{12} \Sigma_{22}^{-1} y $$

This is a linear function of $y$.

The conditional covariance is:

$$ \operatorname{cov}(x_1 \mid x_2 = y) = \Sigma_{11} - \Sigma_{12}\Sigma_{22}^{-1}\Sigma_{21} < \Sigma_{11} $$

This matrix does not depend on $y$ and is strictly smaller than the marginal covariance $\Sigma_{11}$. The reduction in covariance reflects the information gained about $x_1$ by observing $x_2$.

In summary:

Conditioning on $x_2$ narrows the confidence intervals for $x_1$.
The conditional distribution has reduced uncertainty and a mean that depends linearly on the observed value of $x_2$.

1.5. Summary¶

Gaussian distributions have several important closure properties that make them fundamental in probability theory and statistics:

(1) Closure under multiplication:

The product of multiple Gaussian densities is proportional to another Gaussian density. This property ensures that combining Gaussian priors and likelihoods in Bayesian inference yields a Gaussian posterior.

(2) Closure under linear transformations:

If $x$ is a Gaussian random vector, then any linear transformation of $x$, such as $y = A x + b$, results in a Gaussian distribution for $y$. This property greatly simplifies the analysis of linear systems with Gaussian inputs.

(3) Closure under marginalization:

Projections (or marginal distributions) of a joint Gaussian remain Gaussian. In other words, the distribution of any subset of components of a Gaussian vector is Gaussian.

(4) Closure under conditioning:

Conditioning a Gaussian vector on a subset of its components produces a Gaussian distribution for the remaining components. The conditional mean is a linear function of the observed values, and the conditional covariance matrix reflects the reduction in uncertainty due to the conditioning.

These properties make Gaussian distributions analytically tractable and widely applicable in fields such as statistics, machine learning, and signal processing.

2. Linear Model¶

Consider the linear system:

$$ y = A x + \omega, $$

where:

$x$ is the unknown signal or state vector we aim to estimate
$\omega$ represents additive noise or disturbance
$A$ is the known measurement or system matrix

The vectors $x$ and $\omega$ are assumed to be independent. The problem is to estimate $x$ given an observation $y = y_{\text{meas}}$.

The joint distribution of $x$ and $y$ can be expressed as:

$$ \begin{bmatrix} x \\ y \end{bmatrix} = \begin{bmatrix} I & 0 \\ A & I \end{bmatrix} \begin{bmatrix} x \\ \omega \end{bmatrix}. $$

Our goal is to compute the conditional distribution $p(x \mid y = y_{\text{meas}})$ and derive the best estimate of $x$ based on $y_{\text{meas}}$.

2.1. Linear Measurements with Gaussian Noise¶

Assume:

$$ x \sim \mathcal{N}(0, \Sigma_x), \quad \omega \sim \mathcal{N}(0, \Sigma_\omega). $$

Thus, $x$ and $\omega$ are both Gaussian and independent, so the joint distribution of $\begin{bmatrix} x \\ \omega \end{bmatrix}$ is Gaussian.

Using the linear transformation above, $\begin{bmatrix} x \\ y \end{bmatrix}$ is also Gaussian. Its mean vector is:

$$ \mathbb{E}\begin{bmatrix} x \\ y \end{bmatrix} = \begin{bmatrix} 0 \\ 0 \end{bmatrix}, $$

and its covariance matrix is:

$$ \operatorname{cov}\begin{bmatrix} x \\ y \end{bmatrix} = \begin{bmatrix} \Sigma_x & \Sigma_x A^T \\ A \Sigma_x & A \Sigma_x A^T + \Sigma_\omega \end{bmatrix}. $$

2.1.1. Conditional Distribution and MMSE Estimate¶

Using Gaussian conditioning results, the conditional distribution of $x$ given $y = y_{\text{meas}}$ is Gaussian:

$$ x \mid y = y_{\text{meas}} \sim \mathcal{N}\left( \hat{x}_{\text{mmse}}, \operatorname{cov}(x \mid y) \right), $$

where:

$$ \hat{x}_{\text{mmse}} = \mathbb{E}[x \mid y = y_{\text{meas}}] = \Sigma_x A^T (A \Sigma_x A^T + \Sigma_\omega)^{-1} y_{\text{meas}}, $$

$$ \operatorname{cov}(x \mid y) = \Sigma_x - \Sigma_x A^T (A \Sigma_x A^T + \Sigma_\omega)^{-1} A \Sigma_x < \Sigma_x. $$

These expressions define the Minimum Mean Square Error (MMSE) estimate of $x$. The posterior covariance quantifies the residual uncertainty in $x$ after observing $y$.

Note

If $x \sim \mathcal{N}(0, \Sigma)$, the conditional distribution of $x_1$ given $x_2 = y$ is also Gaussian:

$$ \begin{align*} \mathbb{E}[x_1 \mid x_2 = y] &= \Sigma_{12} \Sigma_{22}^{-1} y,\\\\ \operatorname{cov}(x_1 \mid x_2 = y) &= \Sigma_{11} - \Sigma_{12} \Sigma_{22}^{-1} \Sigma_{21} < \Sigma_{11}. \end{align*} $$

Example: linear measurements with Gaussian noise

2.1.2. Signal-to-Noise Ratio (SNR)¶

Using Gaussian conditioning, the conditional distribution of $x$ given $y = y_{\text{meas}}$ is Gaussian:

$$ x \mid y = y_{\text{meas}} \sim \mathcal{N}\left( \hat{x}_{\text{mmse}}, \operatorname{cov}(x \mid y) \right), $$

where the conditional mean is

$$ \hat{x}_{\text{mmse}} = \mathbb{E}[x \mid y = y_{\text{meas}}] = \Sigma_x A^T (A \Sigma_x A^T + \Sigma_\omega)^{-1} y_{\text{meas}}, $$

and the conditional covariance is

$$ \operatorname{cov}(x \mid y) = \Sigma_x - \Sigma_x A^T (A \Sigma_x A^T + \Sigma_\omega)^{-1} A \Sigma_x. $$

The inequality $\operatorname{cov}(x \mid y) < \Sigma_x$ indicates that conditioning on $y$ reduces the uncertainty in $x$.

The quality of the MMSE estimate $\hat{x}_{\text{mmse}}$ depends critically on the relative magnitudes of signal and noise, represented by $\Sigma_x$ and $\Sigma_\omega$:

$\Sigma_x$ describes the prior variance (or energy) of the signal $x$.
$\Sigma_\omega$ represents the variance (or energy) of the measurement noise $\omega$.

The ratio of signal energy to noise energy defines the Signal-to-Noise Ratio (SNR) of the system:

A high signal energy (large $\Sigma_x$) and low noise energy (small $\Sigma_\omega$) yield a higher SNR, resulting in a more accurate and confident estimate of $x$.
Conversely, a low SNR degrades the estimation quality, increasing the uncertainty in $\hat{x}_{\text{mmse}}$.

A key quantity in this context is the gain matrix:

$$ L = \Sigma_x A^T (A \Sigma_x A^T + \Sigma_\omega)^{-1}. $$

This matrix acts as a filter, weighting the measurement $y$ to produce the optimal estimate $\hat{x}_{\text{mmse}}$. The structure of $L$ encapsulates the influence of both the prior knowledge about $x$ (through $\Sigma_x$) and the measurement noise characteristics (through $\Sigma_\omega$).

In summary, the SNR determines the confidence and precision of the MMSE estimate.

A higher SNR results in a gain matrix $L$ that places greater weight on the measurements, improving estimation accuracy.
A lower SNR leads to a smaller gain matrix, reducing the impact of $y$ and reflecting increased uncertainty.

2.1.3. Posterior Covariance: Matrix Form¶

The posterior covariance of $x$ given $y$ is

$$ \operatorname{cov}(x \mid y) = \Sigma_x - \Sigma_x A^T (A \Sigma_x A^T + \Sigma_\omega)^{-1} A \Sigma_x. $$

Introducing the gain matrix:

$$ L = \Sigma_x A^T (A \Sigma_x A^T + \Sigma_\omega)^{-1}, $$

the posterior covariance can be rewritten as:

$$ \operatorname{cov}(x \mid y) = \Sigma_x - L A \Sigma_x. $$

Observe that:

$$ \Sigma_x - L A \Sigma_x = (I - L A) \Sigma_x. $$

This alternative form highlights the filtering effect of the gain matrix $L$ and the system matrix $A$:

The matrix $(I - L A)$ scales the prior covariance $\Sigma_x$, accounting for the information gained from observing $y$.
When $L A$ is close to the identity matrix (high signal-to-noise ratio), the posterior covariance shrinks significantly, indicating a confident estimate.
When $L A$ is close to zero (low signal-to-noise ratio), the posterior covariance remains close to the prior, reflecting limited information gain.

This matrix form is particularly useful in control and filtering applications, such as the Kalman filter, where the gain matrix $L$ plays a central role in updating both the state estimate and its uncertainty.

2.1.4. Posterior Covariance and Gain Matrix Identities¶

The posterior covariance of $x$ given $y$ has a closed-form expression that can be derived using properties of Gaussian conditioning and matrix inversion. Specifically, we start from the expression:

$$ \operatorname{cov}(x \mid y) = \Sigma_x - \Sigma_x A^T (A \Sigma_x A^T + \Sigma_\omega)^{-1} A \Sigma_x. $$

This formula shows that the posterior covariance is obtained by subtracting a positive semi-definite matrix (the reduction in uncertainty due to conditioning on $y$) from the prior covariance $\Sigma_x$. However, an alternative and often more convenient expression for the posterior covariance is:

$$ \operatorname{cov}(x \mid y) = \left( \Sigma_x^{-1} + A^T \Sigma_\omega^{-1} A \right)^{-1}. $$

Derivation of the Matrix Identity

The derivation of this identity relies on the Woodbury matrix inversion lemma, which states:

$$ (U + V C V^T)^{-1} = U^{-1} - U^{-1} V (C^{-1} + V^T U^{-1} V)^{-1} V^T U^{-1}, $$

where $U$ and $C$ are invertible matrices.

In our case:

$$ \begin{align*} U &= \Sigma_\omega \\ V &= A \\ C &= \Sigma_x \end{align*} $$

Applying the lemma to the inverse of $A \Sigma_x A^T + \Sigma_\omega$ yields:

$$ (A \Sigma_x A^T + \Sigma_\omega)^{-1} = \Sigma_\omega^{-1} - \Sigma_\omega^{-1} A (\Sigma_x^{-1} + A^T \Sigma_\omega^{-1} A)^{-1} A^T \Sigma_\omega^{-1}. $$

Substituting this into the original expression for $\operatorname{cov}(x \mid y)$, we obtain:

$$ \begin{aligned} \operatorname{cov}(x \mid y) &= \Sigma_x - \Sigma_x A^T \left[\Sigma_\omega^{-1} - \Sigma_\omega^{-1} A (\Sigma_x^{-1} + A^T \Sigma_\omega^{-1} A)^{-1} A^T \Sigma_\omega^{-1}\right] A \Sigma_x \\ &= (\Sigma_x^{-1} + A^T \Sigma_\omega^{-1} A)^{-1}. \end{aligned} $$

This compact expression reveals that the posterior covariance can be viewed as the inverse of the sum of information matrices:

$\Sigma_x^{-1}$ represents the information (inverse covariance) from the prior distribution of $x$
$A^T \Sigma_\omega^{-1} A$ represents the information contribution from the measurements $y$

The posterior covariance thus combines prior information and measurement information in an elegant additive form.

The gain matrix $L$, which determines how $y$ is weighted to estimate $x$, can also be expressed in a compact form:

$$ L = (\Sigma_x^{-1} + A^T \Sigma_\omega^{-1} A)^{-1} A^T \Sigma_\omega^{-1}. $$

This expression shows that:

$A^T \Sigma_\omega^{-1}$ maps the measurement $y$ into the space of $x$, weighting it according to the noise covariance $\Sigma_\omega$
$(\Sigma_x^{-1} + A^T \Sigma_\omega^{-1} A)^{-1}$ incorporates the combined prior and measurement information, balancing their contributions

These matrix identities are not only algebraically elegant but also have clear probabilistic interpretations:

The posterior covariance reflects the combined precision (inverse variance) of prior and measurement information
The gain matrix $L$ balances the influence of prior information and noisy measurements, optimizing the estimation of $x$

These expressions are fundamental in linear estimation theory and form the mathematical backbone of Kalman filtering, Bayesian linear regression, and Gaussian process regression.

Note

Recall the classic two-sensor fusion problem, where two independent measurements $a$ and $b$ of a scalar quantity $x$ are combined optimally. If:

$$ a \sim \mathcal{N}(x, \sigma_a^2), \quad b \sim \mathcal{N}(x, \sigma_b^2), $$

then the variance of the optimal estimate $\hat{x}$ (which minimizes mean squared error) is given by:

$$ \operatorname{var}[\hat{x}] = \frac{1}{\frac{1}{\sigma_a^2} + \frac{1}{\sigma_b^2}}. $$

This elegant formula illustrates how information from two noisy sensors is combined: the precisions ($1/\sigma^2$) are additive, and the overall uncertainty is reduced.

This result is directly connected to the matrix identity for $\operatorname{cov}(x \mid y)$:

$$ \operatorname{cov}(x \mid y) = \left( \Sigma_x^{-1} + A^T \Sigma_\omega^{-1} A \right)^{-1}. $$

In the scalar case, $\Sigma_x = \sigma_a^2$, $\Sigma_\omega = \sigma_b^2$, and $A = 1$.

2.2 MMSE and Least-Squares Estimation¶

For Gaussian priors, the Minimum Mean Square Error (MMSE) estimator coincides with the Maximum a Posteriori (MAP) estimate. This is because both methods seek to maximize the posterior distribution, which, under Gaussian assumptions, is also the distribution that minimizes the mean squared error.

In contrast, the Least-Squares (LS) approach focuses solely on minimizing the squared error between observed and predicted measurements, without incorporating prior information. Specifically, LS minimizes:

$$ \| y - A x \|^2 = \sum_{i=1}^m (y_i - a_i^T x)^2, $$

where $A$ is the measurement matrix with rows $a_i^T$.

In the absence of prior information about $x$ (i.e., assuming $\Sigma_x \to \infty$, which corresponds to a non-informative prior), the MMSE estimate reduces to the LS solution. This highlights how Bayesian estimation generalizes LS by incorporating prior beliefs.

2.3 Weighted Norms and Weighted Least-Squares¶

In practice, measurement noise is rarely uniform across all measurements. To account for varying levels of noise, we introduce weights into the estimation process. Instead of minimizing the unweighted residual, we consider:

$$ \sum_{i=1}^m \omega_i (y_i - a_i^T x)^2, $$

where $\omega_i$ are positive weights reflecting the precision (inverse variance) of each measurement (note that these $\omega_i$ are distinct from the noise $\omega$ in the linear model discussed earlier).

Equivalently, we can write this using a positive-definite weight matrix $W$:

$$ \| y_{\text{meas}} - A x \|_W^2 = (y_{\text{meas}} - A x)^T W (y_{\text{meas}} - A x), $$

where $W$ encodes the relative confidence in different measurements.

2.3.1 Weighted Least-Squares (WLS)¶

The Weighted Least-Squares (WLS) problem seeks to minimize the weighted residual:

$$ \text{minimize} \quad \| y_{\text{meas}} - A x \|_W^2. $$

The optimal solution is:

$$ x_{\text{wls}} = (A^T W A)^{-1} A^T W y_{\text{meas}}. $$

This solution geometrically corresponds to the closest point in the range of $A$ to $y_{\text{meas}}$, measured in the weighted norm defined by $W$.

2.3.2 Connection Between MMSE and WLS¶

A key insight emerges when we set the weight matrix $W$ equal to the inverse of the noise covariance:

$$ W = \Sigma_\omega^{-1}. $$

In this case, the WLS solution becomes:

$$ x_{\text{wls}} = (A^T \Sigma_\omega^{-1} A)^{-1} A^T \Sigma_\omega^{-1} y_{\text{meas}}. $$

The MMSE estimate is:

$$ x_{\text{mmse}} = \left( \Sigma_x^{-1} + A^T \Sigma_\omega^{-1} A \right)^{-1} A^T \Sigma_\omega^{-1} y_{\text{meas}}. $$

Comparing the two expressions:

As $\Sigma_x \to \infty$ (uninformative prior), $\Sigma_x^{-1} \to 0$, and the MMSE estimate approaches the WLS solution:

$$ x_{\text{mmse}} \to x_{\text{wls}}. $$

If $\Sigma_\omega = I$, then $W = I$ and WLS reduces to standard LS:

$$ x_{\text{wls}} = (A^T A)^{-1} A^T y_{\text{meas}}. $$

In this case, the MMSE estimate also reduces to LS as $\Sigma_x \to \infty$.

Interpretation of Weighted Norms

Weighted norms assign higher penalties to residuals in directions with low measurement noise (high precision). Specifically:

Directions in which the measurement noise is low (small variances in $\Sigma_\omega$) are weighted more heavily in $W = \Sigma_\omega^{-1}$.
This prioritizes minimizing errors in high-precision directions, resulting in more accurate and reliable estimates.

In practical terms, WLS provides a natural way to handle heteroscedastic noise (i.e., measurements with varying levels of uncertainty) by appropriately weighting each residual.

2.4. Example: Comparing LS, WLS, and MMSE Estimation¶

Consider a simple measurement model:

$$ y = A x + \omega, $$

where:

$x$ is a scalar (for simplicity),
$A$ is a $2 \times 1$ matrix:

$$ A = \begin{bmatrix} 1 \\ 1 \end{bmatrix}, $$

The true value of $x$ is unknown, and we observe $y_1$ and $y_2$:

$$ y = \begin{bmatrix} y_1 \\ y_2 \end{bmatrix}. $$

Measurement noise covariance:

$$ \Sigma_\omega = \begin{bmatrix} \sigma_1^2 & 0 \\ 0 & \sigma_2^2 \end{bmatrix}. $$

Assume:

$$ \sigma_1^2 = 1, \quad \sigma_2^2 = 4, $$

which means $y_1$ is more reliable than $y_2$.

2.4.1. Least-Squares (LS) Estimate¶

Without weighting (assuming $\Sigma_\omega = I$), the LS solution minimizes:

$$ \| y - A x \|^2 = (y_1 - x)^2 + (y_2 - x)^2. $$

The optimal estimate is:

$$ x_{\text{ls}} = \frac{y_1 + y_2}{2}. $$

2.4.2. Weighted Least-Squares (WLS) Estimate¶

Set $W = \Sigma_\omega^{-1}$:

$$ W = \begin{bmatrix} 1 & 0 \\ 0 & 0.25 \end{bmatrix}. $$

The WLS objective:

$$ \| y - A x \|_W^2 = (y_1 - x)^2 + 0.25(y_2 - x)^2. $$

The optimal estimate:

$$ x_{\text{wls}} = \frac{1 \cdot y_1 + 0.25 \cdot y_2}{1 + 0.25} = \frac{y_1 + 0.25 y_2}{1.25}. $$

This places more weight on $y_1$ because its noise variance is lower.

2.4.3. MMSE Estimate with Prior on $x$¶

Now, assume a prior:

$$ x \sim \mathcal{N}(0, \sigma_x^2), \quad \sigma_x^2 = 10. $$

Using:

$$ x_{\text{mmse}} = \left(\sigma_x^{-2} + A^T \Sigma_\omega^{-1} A\right)^{-1} A^T \Sigma_\omega^{-1} y, $$

where:

$$ A^T \Sigma_\omega^{-1} A = 1 + 0.25 = 1.25. $$

Thus:

$$ x_{\text{mmse}} = \left(\frac{1}{10} + 1.25\right)^{-1} \left(1 \cdot y_1 + 0.25 \cdot y_2\right) = \frac{1}{1.35} \cdot (y_1 + 0.25 y_2). $$

Note that:

As $\sigma_x^2 \to \infty$ (uninformative prior), $x_{\text{mmse}} \to x_{\text{wls}}$.
If $\Sigma_\omega = I$, $x_{\text{mmse}}$ reduces to LS as $\sigma_x^2 \to \infty$.

3. Gaussian Process Regression¶

3.1 Bayesian Linear Regression: Motivation¶

In regression problems, we model the relationship between inputs $x \in \mathbb{R}^d$ and outputs $y \in \mathbb{R}$ using a linear model:

$$ y = \omega^T x + \varepsilon $$

where $\omega$ is the weight vector and $\varepsilon$ represents observation noise.

Maximum Likelihood Estimation (MLE) seeks the value of $\omega$ that maximizes the likelihood of the observed data. However, MLE often overfits when data is sparse or noisy, yielding only a point estimate and failing to quantify prediction uncertainty.

A more robust alternative is Maximum a Posteriori (MAP) estimation, which incorporates a Gaussian prior:

$$ \omega \sim \mathcal{N}(0, \gamma I) $$

While MAP regularizes the estimate and mitigates overfitting, it still produces only a point estimate of $\omega$ and lacks a direct representation of prediction uncertainty.

The Bayesian approach instead computes the posterior distribution of $\omega$ given the data $D$:

$$ p(\omega \mid D) $$

which allows us to integrate over $\omega$ to obtain the predictive distribution for a new input $x$:

$$ p(y \mid x,D) = \int p(y \mid x,\omega) p(\omega \mid D) d\omega $$

This framework naturally quantifies uncertainty from both data noise and parameter variability.

Bayesian regression can be viewed from two complementary perspectives:

Weight-space view: focuses on the posterior distribution over $\omega$.
Function-space view: focuses on the distribution over functions $f(x) = \omega^T x$.

In the next section, we transition to the function-space view, which leads to Gaussian Process (GP) regression.

Weight-Space View

In the weight-space view, we model the relationship:

$$ y = \omega^T x + \varepsilon $$

where:

$\omega \sim \mathcal{N}(0, \gamma I)$ is a prior over weights,
$\varepsilon \sim \mathcal{N}(0, \sigma^2)$ is measurement noise.

(1) Left Plots: Posterior Distribution in Weight Space

Circles and ellipses in the plots represent contours of constant probability density for the Gaussian distributions in weight space.

The large circles correspond to the prior distribution of $\omega$, which is assumed isotropic ($\omega \sim \mathcal{N}(0, \gamma I)$), reflecting initial uncertainty.
The smaller ellipses correspond to the posterior distribution of $\omega$ after observing data points:

(2) Right Plots: Functions from Weight Samples

Each red line corresponds to a linear function $f(x) = \omega^T x$ drawn from the posterior distribution over $\omega$.

Key Concepts Illustrated:

In the weight-space view, we start by placing a prior distribution on $\omega$.
Observations $D$ update this to the posterior distribution of $\omega$, reflected by the shrinking and reorienting ellipses.
The resulting function predictions come from sampling $\omega$ and plotting $f(x) = \omega^T x$ for each sample.
The shaded predictive regions in function space are projections of the weight posterior, summarizing uncertainty.

Function-Space View

Function-space view will be discussed at the following section.

3.2 Gaussian Process Regression¶

Now we are ready to discuss Gaussian Process Regression. With all the background covered, we can fully understand the core concepts.

I recommend watching the following video lecture by Prof. Marc Deisenroth (currently the Google DeepMind Chair of Machine Learning and Artificial Intelligence at University College London).

This lecture is one of the best resources on Gaussian Processes, providing clear visual illustrations and deep insights from two-variable Gaussian models to multivariate Gaussian distributions, and ultimately to infinite-dimensional Gaussian processes.

from IPython.display import YouTubeVideo
YouTubeVideo('92-98SYOdlY', width = "560", height = "315")

Consider a dataset:

$$ D = \left\{ (x_1, y_1), \dots, (x_n, y_n) \right\}, \quad x_i \in \mathbb{R}^d, \; y_i \in \mathbb{R} $$

The model assumes:

$$ p(y_i \mid x_i, \omega) = \mathcal{N}(y_i \mid \omega^T x_i, \sigma^2), \quad \omega \sim \mathcal{N}(0, \gamma I) $$

Thus, each observation satisfies:

$$ y_i = \omega^T x_i + \varepsilon_i, \quad \varepsilon_i \sim \mathcal{N}(0, \sigma^2) $$

For any $x \in \mathbb{R}^d$, the projection $\omega^T x$ is a univariate Gaussian:

$$ Z_x = x^T \omega \sim \mathcal{N}(0, \gamma x^T x) $$

Thus, $\{Z_x\}_{x \in \mathbb{R}^d}$ defines a Gaussian process (GP):

$$ \mu(x) = \mathbb{E}[Z_x] = 0 $$

$$k(x,x') = \mathbf{cov}\,(Z_x, Z_{x'})= \mathbf{E}\, Z_x Z_{x'} - \mathbf{E}\, Z_x \mathbf{E}\,Z_{x'} = \mathbf{E}\,x^T \omega \omega^T x' = x^T \mathbf{E}\,\left(\omega \omega^T \right) x' = \gamma x^Tx'$$

This process defines a distribution over functions $f(x) = Z_x$, where the kernel function $k(x,x')$ captures the similarity structure between points.

With a nonlinear basis transformation $\varphi(x)$ (e.g., polynomial or RBF features), we generalize:

$$ Z_x = \omega^T \varphi(x), \quad k(x,x') = \gamma \varphi(x)^T \varphi(x') $$

3.2.1. Key Step in GP Regression¶

The key insight in GP regression is modeling:

$$ y_i = Z_{x_i} + \varepsilon_i $$

Define:

$z \sim \mathcal{N}(\mu, K) \in \mathbb{R}^n$ representing latent function values $Z_{x_i}$
$\varepsilon \sim \mathcal{N}(0, \sigma^2 I) \in \mathbb{R}^n$ as independent observation noise

Then:

$$ y = z + \varepsilon \sim \mathcal{N}(\mu, C) $$

where:

$$ \mu = \begin{bmatrix} \mu(x_1) \\ \vdots \\ \mu(x_n) \end{bmatrix}, \quad K_{ij} = k(x_i, x_j), \quad C = K + \sigma^2 I $$

3.2.2. Conditional Prediction¶

Partition the indices into:

$$ a = \{1,\dots,l\}, \quad b = \{l+1,\dots,n\} $$

corresponding to training and test sets:

$$ y = \begin{bmatrix} y_a \\ y_b \end{bmatrix}, \quad \mu = \begin{bmatrix} \mu_a \\ \mu_b \end{bmatrix} $$

$$ C = \begin{bmatrix} C_{aa} & C_{ab} \\ C_{ba} & C_{bb} \end{bmatrix}, \quad K = \begin{bmatrix} K_{aa} & K_{ab} \\ K_{ba} & K_{bb} \end{bmatrix} $$

The conditional distribution of $y_a$ given $y_b$ is Gaussian:

$$ y_a \mid y_b \sim \mathcal{N}(m, D) $$

where:

$$ \begin{aligned} m &= \mu_a + C_{ab} C_{bb}^{-1}(y_b - \mu_b) \\ D &= C_{aa} - C_{ab} C_{bb}^{-1} C_{ba} \end{aligned} $$

Substituting $C = K + \sigma^2 I$:

$$ \begin{aligned} m &= \mu_a + K_{ab}(K_{bb} + \sigma^2 I)^{-1}(y_b - \mu_b) \\ D &= K_{aa} + \sigma^2 I - K_{ab}(K_{bb} + \sigma^2 I)^{-1} K_{ba} \end{aligned} $$

This formulation yields the predictive distribution for test inputs, incorporating both the latent function correlations and the observation noise.

Function-Space View

The function-space view directly considers the distribution over functions:

$$ f(x) = \omega^T x $$

and generalizes to nonlinear transformations:

$$ f(x) = \omega^T \varphi(x) $$

where $\varphi(x)$ is a feature map (e.g., polynomial, RBF features).

In this view, $f(x)$ is modeled as a Gaussian process (GP):

$$ f(x) \sim GP(0, k(x,x')) $$

where the kernel $k(x,x')$ encodes the similarity between inputs:

$$ k(x,x') = \gamma \varphi(x)^T \varphi(x') $$

Instead of explicitly working with weights $\omega$, GP regression infers the distribution of function values $f(x)$ at all inputs, directly from the observed data.

This approach:

Avoids explicit basis functions.
Enables nonparametric modeling (the complexity of the model grows with data).
Naturally quantifies uncertainty in predictions.

4. Gaussian Process Regression with Python¶

This section presents a practical example of Gaussian Process (GP) regression using Python. The example and code are adapted from Domino Data Lab’s blog.

A Gaussian process can be viewed as a distribution over functions. Just as a multivariate Gaussian distribution is fully characterized by its mean vector and covariance matrix, a Gaussian process is fully determined by a mean function and a covariance function:

$$ f(x) \sim \mathcal{GP} \bigl(m(x), k(x,x')\bigr), $$

where:

$m(x) = \mathbb{E}[f(x)]$ is the mean function,
$k(x,x') = \operatorname{cov}(f(x), f(x'))$ is the covariance function (or kernel).

For example, a commonly used GP specification is:

$$ \begin{aligned} m(x) &= 0, \\ k(x,x') &= \theta_1 \exp\left(-\frac{\theta_2}{2}(x - x')^2\right). \end{aligned} $$

This squared exponential kernel implies that function values at points $x$ and $x'$ that are close together tend to be highly correlated (i.e., $k(x,x')$ close to 1), while points far apart are nearly uncorrelated ($k(x,x')$ close to 0).

This section will demonstrate:

How to implement GP regression in Python,
How to visualize the prior and posterior distributions of functions,
How the kernel function encodes assumptions about smoothness and correlation.

import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline

$$k(x,x') = \theta_1 \exp\left( -\frac{\theta_2}{2} \left(x-x' \right)^2 \right)$$

def exponential_cov(x, y, params):
    return params[0] * np.exp( - params[1]/2 * np.subtract.outer(x, y)**2)

$$p(x \mid y) = \mathcal{N}\left( \mu_x + \Sigma_{xy} \Sigma^{-1}_{y} \,(y-\mu_y),\; \Sigma_{x} - \Sigma_{xy}\Sigma^{-1}_{y}\Sigma_{xy}^T \right)$$

def conditional(x_new, x, y, params):

    B = exponential_cov(x_new, x, params)
    C = exponential_cov(x, x, params)
    A = exponential_cov(x_new, x_new, params)

    mu = np.linalg.inv(C).dot(B.T).T.dot(y)
    sigma = A - B.dot(np.linalg.inv(C).dot(B.T))

    return(mu.squeeze(), sigma.squeeze())

theta = [1, 5]
sigma = exponential_cov(0, 0, theta)
xpts = np.arange(-3, 3, step = 0.01)


plt.figure(figsize = (6, 4))
plt.errorbar(xpts, np.zeros(len(xpts)), yerr = sigma, capsize = 0, alpha = 0.2)
plt.show()

# first observation

x = [1.]
y = [np.random.normal(scale = sigma)]
y

[-1.864243515539291]

sigma_1 = exponential_cov(x, x, theta)

def predict(x, data, kernel, params, sigma, t):
    k = [kernel(x, y, params) for y in data]
    Sinv = np.linalg.inv(sigma)
    y_pred = np.dot(k, Sinv).dot(t)
    sigma_new = kernel(x, x, params) - np.dot(k, Sinv).dot(k)
    return y_pred, sigma_new

x_pred = np.linspace(-3, 3, 1000)
predictions = [predict(i, x, exponential_cov, theta, sigma_1, y) for i in x_pred]

y_pred, sigmas = np.transpose(predictions)

plt.figure(figsize = (6, 4))
plt.errorbar(x_pred, y_pred, yerr = sigmas, capsize = 0, alpha = 0.1)
plt.plot(x, y, "ro")
plt.show()

# the second observation

m, s = conditional([-0.7], x, y, theta)
y2 = np.random.normal(m, s)
y2

-0.614512413856512

x.append(-0.7)
y.append(y2)

sigma_2 = exponential_cov(x, x, theta)
predictions = [predict(i, x, exponential_cov, theta, sigma_2, y) for i in x_pred]

y_pred, sigmas = np.transpose(predictions)

plt.figure(figsize = (6, 4))
plt.errorbar(x_pred, y_pred, yerr = sigmas, capsize = 0, alpha = 0.2)
plt.plot(x, y, "ro")
plt.show()

# more observations

x_more = [-2.1, -1.5, 0.3, 1.8, 2.5]
mu, s = conditional(x_more, x, y, theta)
y_more = np.random.multivariate_normal(mu, s)
y_more

array([-0.14342083,  1.15049627, -0.40631288, -2.33985382, -0.23203867])

x += x_more
y += y_more.tolist()

sigma_new = exponential_cov(x, x, theta)
predictions = [predict(i, x, exponential_cov, theta, sigma_new, y) for i in x_pred]

y_pred, sigmas = np.transpose(predictions)

plt.figure(figsize = (6, 4))
plt.errorbar(x_pred, y_pred, yerr = sigmas, capsize = 0, alpha = 0.2)
plt.plot(x, y, "ro")
plt.show()

There are a number of libraries available for specifying and fitting GP models in a more automated way.

scikit-learn (we will explore this)
GPflow
PyMC3

from sklearn.gaussian_process import GaussianProcessRegressor
from sklearn.gaussian_process.kernels import Matern, WhiteKernel, ConstantKernel

xp = np.linspace(0,10,100)
yp = np.sin(xp)

plt.figure(figsize = (6, 4))
plt.plot(xp, yp)
plt.show()

x = np.array([0.8, 1, 1.2, 3, 5, 6, 8, 8.8, 9, 9.5])
r = np.sin(x)
y = r + 0.1*np.random.randn(10)

plt.figure(figsize = (6, 4))
plt.plot(xp, yp)
plt.plot(x, y, '.')
plt.show()

scikit-learn offers a library of about kernels to choose from. A flexible choice to start with is the Matern covariance.

kernel = ConstantKernel() + Matern(length_scale=2, nu=3/2) + WhiteKernel(noise_level=1)

X = x.reshape(-1,1)

gp = GaussianProcessRegressor(kernel = kernel)

gp.fit(X,y)

/usr/local/lib/python3.11/dist-packages/sklearn/gaussian_process/kernels.py:442: ConvergenceWarning: The optimal value found for dimension 0 of parameter k1__k1__constant_value is close to the specified lower bound 1e-05. Decreasing the bound and calling fit again may find a better value.
  warnings.warn(

GaussianProcessRegressor(kernel=1**2 + Matern(length_scale=2, nu=1.5) + WhiteKernel(noise_level=1))

GaussianProcessRegressor(kernel=1**2 + Matern(length_scale=2, nu=1.5) + WhiteKernel(noise_level=1))

gp.kernel_

0.00316**2 + Matern(length_scale=2.33, nu=1.5) + WhiteKernel(noise_level=0.000579)

x_pred = np.linspace(0, 10, 100).reshape(-1,1)
y_pred, sigma = gp.predict(x_pred, return_std=True)

plt.figure(figsize = (6, 4))
plt.plot(x_pred, y_pred, label = 'Mean prediction')
plt.fill_between(x_pred.ravel(), y_pred - 1.96 * sigma, y_pred + 1.96 * sigma, alpha = 0.2, label = '95% CI')
plt.plot(X, y, '.', markersize = 5, label = 'Observations')
plt.xlabel('$x$')
plt.ylabel('$f(x)$')
plt.title('GP with Data and Confidence Interval')
plt.legend()
plt.show()

%%javascript
$.getScript('https://kmahelona.github.io/ipython_notebook_goodies/ipython_notebook_toc.js')