What is a Gaussian Process?
A Gaussian process (GP) is a probabilistic model that treats the function itself as a random variable. Intuitively, you can think of it as a multivariate normal distribution extended to infinite dimensions.
While an ordinary normal distribution defines a distribution over a vector \(\mathbf{y} \in \mathbb{R}^n\) , a GP defines a distribution over a function \(f: \mathcal{X} \rightarrow \mathbb{R}\) . The formal definition: \(f\) is a Gaussian process if, for any finite set of inputs \(\{\mathbf{x}_1, \ldots, \mathbf{x}_n\}\) , the corresponding vector of function values \([f(\mathbf{x}_1), \ldots, f(\mathbf{x}_n)]^T\) follows a multivariate Gaussian distribution.
A GP is fully specified by a mean function \(m(\mathbf{x})\) and a kernel (covariance) function \(k(\mathbf{x}, \mathbf{x}')\) :
\[ f(\mathbf{x}) \sim \mathcal{GP}(m(\mathbf{x}), k(\mathbf{x}, \mathbf{x}')) \tag{1} \]In practice \(m(\mathbf{x}) = 0\) is a common choice; the function shape is controlled by the kernel.
Kernel Functions (Designing the Prior)
The kernel \(k(\mathbf{x}, \mathbf{x}')\) specifies the covariance of function values at two inputs and encodes prior beliefs about smoothness, periodicity, and so on.
RBF (Squared Exponential) Kernel
The most popular choice is the RBF kernel, also called the squared exponential kernel:
\[ k_{\text{RBF}}(\mathbf{x}, \mathbf{x}') = \sigma_f^2 \exp\left(-\frac{\|\mathbf{x} - \mathbf{x}'\|^2}{2 \ell^2}\right) \tag{2} \]- \(\sigma_f^2\) : output variance (function amplitude)
- \(\ell\) : length scale, controlling how quickly correlation decays with distance
A larger \(\ell\) produces smoother functions; a smaller \(\ell\) allows rapid local variation. RBF samples are infinitely differentiable.
Matern Kernel
For real problems the RBF kernel is often “too smooth”. The Matern kernel lets you control differentiability via the smoothness parameter \(\nu\) :
\[ k_{\text{Matern}}(\mathbf{x}, \mathbf{x}') = \sigma_f^2 \frac{2^{1-\nu}}{\Gamma(\nu)} \left(\frac{\sqrt{2\nu}\, \|\mathbf{x} - \mathbf{x}'\|}{\ell}\right)^{\nu} K_{\nu}\left(\frac{\sqrt{2\nu}\, \|\mathbf{x} - \mathbf{x}'\|}{\ell}\right) \tag{3} \]where \(K_{\nu}\) is the modified Bessel function of the second kind. As \(\nu = 1/2\) it reduces to the exponential kernel; as \(\nu \to \infty\) it converges to the RBF kernel. In practice \(\nu = 3/2\) and \(\nu = 5/2\) are most common.
Conditioning on Observations (Posterior)
The real power of a GP is that, given observations, the posterior is available in closed form.
For \(n\) noisy observations \(\mathcal{D} = \{(\mathbf{x}_i, y_i)\}_{i=1}^n\) :
\[ y_i = f(\mathbf{x}_i) + \varepsilon_i, \quad \varepsilon_i \sim \mathcal{N}(0, \sigma_n^2) \tag{4} \]The joint distribution of the observed targets \(\mathbf{y}\) and a new function value \(f_*\) at \(\mathbf{x}_*\) is:
\[ \begin{bmatrix} \mathbf{y} \\ f_* \end{bmatrix} \sim \mathcal{N}\left(\mathbf{0}, \begin{bmatrix} \mathbf{K} + \sigma_n^2 \mathbf{I} & \mathbf{k}_* \\ \mathbf{k}_*^T & k(\mathbf{x}_*, \mathbf{x}_*) \end{bmatrix} \right) \tag{5} \]where \(K_{ij} = k(\mathbf{x}_i, \mathbf{x}_j)\) and \(\mathbf{k}_* = [k(\mathbf{x}_1, \mathbf{x}_*), \ldots, k(\mathbf{x}_n, \mathbf{x}_*)]^T\) . Standard Gaussian conditioning yields the posterior predictive distribution in closed form:
\[ \mu_*(\mathbf{x}_*) = \mathbf{k}_*^T (\mathbf{K} + \sigma_n^2 \mathbf{I})^{-1} \mathbf{y} \tag{6} \] \[ \sigma_*^2(\mathbf{x}_*) = k(\mathbf{x}_*, \mathbf{x}_*) - \mathbf{k}_*^T (\mathbf{K} + \sigma_n^2 \mathbf{I})^{-1} \mathbf{k}_* \tag{7} \]Equation (6) is the predictive mean; (7) is the predictive variance. The variance shrinks where data is dense and grows toward the prior variance \(\sigma_f^2\) far from observations. This automatic uncertainty calibration is the core advantage GPs hold over point-estimate models such as neural networks.
Marginal Likelihood and Hyperparameters
The kernel hyperparameters \(\boldsymbol{\theta} = (\sigma_f^2, \ell, \sigma_n^2)\) are typically learned by maximizing the log marginal likelihood:
\[ \log p(\mathbf{y} \mid \mathbf{X}, \boldsymbol{\theta}) = -\frac{1}{2}\mathbf{y}^T (\mathbf{K} + \sigma_n^2 \mathbf{I})^{-1} \mathbf{y} - \frac{1}{2} \log |\mathbf{K} + \sigma_n^2 \mathbf{I}| - \frac{n}{2} \log 2\pi \tag{8} \]The first term measures data fit, while the second term is a complexity penalty (Occam’s razor). This penalty is the reason GPs avoid overfitting in a Bayesian way.
Python Implementation: 1D GP Regression
We implement GP regression from scratch with NumPy and SciPy. Cholesky factorization is used for numerical stability instead of explicit matrix inversion.
import numpy as np
import matplotlib.pyplot as plt
from scipy.linalg import cho_factor, cho_solve
class GaussianProcessRegressor:
def __init__(self, length_scale=1.0, signal_var=1.0, noise_var=1e-4):
self.l = length_scale
self.sf2 = signal_var
self.sn2 = noise_var
def rbf(self, X1, X2):
"""Eq. (2): RBF kernel"""
d2 = (np.sum(X1**2, axis=1, keepdims=True)
- 2.0 * X1 @ X2.T
+ np.sum(X2**2, axis=1))
return self.sf2 * np.exp(-0.5 * d2 / self.l**2)
def fit(self, X, y):
self.X_train = X.copy()
self.y_train = y.copy()
K = self.rbf(X, X) + self.sn2 * np.eye(len(X))
# Stable Cholesky factorization
self.L_, self.lower_ = cho_factor(K, lower=True)
self.alpha_ = cho_solve((self.L_, self.lower_), y)
def predict(self, X_new):
"""Eqs. (6), (7): posterior mean and variance"""
k_star = self.rbf(self.X_train, X_new)
mu = k_star.T @ self.alpha_
v = cho_solve((self.L_, self.lower_), k_star)
var = self.sf2 - np.sum(k_star * v, axis=0)
var = np.maximum(var, 1e-10)
return mu, var
def log_marginal_likelihood(self):
"""Eq. (8): log marginal likelihood"""
n = len(self.y_train)
log_det = 2.0 * np.sum(np.log(np.diag(self.L_)))
return (-0.5 * self.y_train @ self.alpha_
- 0.5 * log_det
- 0.5 * n * np.log(2 * np.pi))
Posterior Sampling and Prediction
We construct the posterior from noisy observations and visualize the 95% confidence band.
# Underlying true function
def true_f(x):
return np.sin(2.0 * x) + 0.3 * x
np.random.seed(0)
X_train = np.array([[-3.0], [-2.0], [-0.5], [1.0], [2.5], [3.5]])
y_train = true_f(X_train.ravel()) + 0.1 * np.random.randn(len(X_train))
gp = GaussianProcessRegressor(length_scale=1.0, signal_var=1.0, noise_var=0.01)
gp.fit(X_train, y_train)
X_test = np.linspace(-5.0, 5.0, 300).reshape(-1, 1)
mu, var = gp.predict(X_test)
sigma = np.sqrt(var)
plt.figure(figsize=(10, 5))
plt.plot(X_test, true_f(X_test.ravel()), "k--", label="True function")
plt.plot(X_test, mu, "b-", label="GP mean")
plt.fill_between(X_test.ravel(), mu - 1.96 * sigma, mu + 1.96 * sigma,
alpha=0.2, color="blue", label="95% CI")
plt.scatter(X_train, y_train, c="red", s=60, zorder=5, label="Observations")
plt.xlabel("x"); plt.ylabel("f(x)")
plt.title("Gaussian Process Regression (RBF kernel)")
plt.legend(); plt.grid(True); plt.tight_layout()
plt.savefig("gp_regression.png", dpi=150)
print(f"log marginal likelihood: {gp.log_marginal_likelihood():.4f}")
The confidence band narrows around observations and widens away from data, which is the direct consequence of the first term \(k(\mathbf{x}_*, \mathbf{x}_*) = \sigma_f^2\) dominating Eq. (7) far from the training set.
Effect of Hyperparameters
The length scale \(\ell\) and noise variance \(\sigma_n^2\) shape GP behavior dramatically.
Length scale \(\ell\)
- Small \(\ell\) : rapid local variation, sharp interpolation between data; risk of overfitting.
- Large \(\ell\) : smooth approximation that does not pass through every point; risk of underfitting.
fig, axes = plt.subplots(1, 3, figsize=(15, 4))
for ax, l in zip(axes, [0.3, 1.0, 3.0]):
gp = GaussianProcessRegressor(length_scale=l, signal_var=1.0, noise_var=0.01)
gp.fit(X_train, y_train)
mu, var = gp.predict(X_test)
sigma = np.sqrt(var)
ax.plot(X_test, true_f(X_test.ravel()), "k--")
ax.plot(X_test, mu, "b-")
ax.fill_between(X_test.ravel(), mu - 1.96*sigma, mu + 1.96*sigma,
alpha=0.2, color="blue")
ax.scatter(X_train, y_train, c="red", s=40, zorder=5)
ax.set_title(f"length_scale = {l}")
ax.grid(True)
plt.tight_layout()
plt.savefig("gp_length_scale.png", dpi=150)
Noise variance \(\sigma_n^2\)
A small \(\sigma_n^2\) forces the posterior to interpolate the data exactly; a larger value allows the curve to deviate from observations and stay smooth. It also acts as a numerical jitter stabilizing the inversion of \(\mathbf{K} + \sigma_n^2 \mathbf{I}\) in Eq. (6) (typical jitter values: \(10^{-6}\) to \(10^{-4}\) ).
In practice, hyperparameters are tuned automatically by maximizing the log marginal likelihood (Eq. 8) with scipy.optimize.minimize.
Computational Cost and Approximations
The main weakness of a GP is the \(O(n^3)\) training cost (Cholesky factorization of the kernel matrix) and \(O(n^2)\) memory. Pure GP regression is generally limited to \(n \approx 10^4\) .
Common scalable approximations include:
- Inducing point methods (FITC, VFE, SVGP): \(m \ll n\) pseudo-inputs reduce cost to \(O(nm^2)\)
- Random Fourier Features: approximate the kernel with finite-dimensional random features
- Local approximations: predict using only nearby data points
Connection to Bayesian Optimization
The GP posterior mean (Eq. 6) and variance (Eq. 7) serve as the surrogate model in Bayesian optimization. Acquisition functions such as Expected Improvement and UCB are functions of \(\mu_*\) and \(\sigma_*\) that prefer points that combine high uncertainty with expected improvement.
For example, UCB takes the simple form:
\[ \alpha_{\text{UCB}}(\mathbf{x}) = \mu_*(\mathbf{x}) - \kappa\, \sigma_*(\mathbf{x}) \tag{9} \]In other words, the GP machinery you built above is the foundation of Bayesian optimization. See the next article for the full picture.
Related Articles
- Bayesian Optimization: Fundamentals and Python Implementation - Uses the GP from this article as a surrogate model and adds acquisition functions (EI/UCB/PI) for sample-efficient global optimization.
- Markov Chain Monte Carlo (MCMC): Metropolis-Hastings and Gibbs Sampling - Computational methods for Bayesian inference underlying GPs.
- From SGD to Adam: Evolution of Gradient-Based Optimization - Gradient-based optimizers whose hyperparameters are typical targets of GP-based Bayesian optimization.
- Support Vector Machine (SVM): Theory and Python Implementation - Another RBF-kernel method; contrast GP’s probabilistic predictions with SVM’s deterministic margin maximization.
- Particle Swarm Optimization (PSO): Theory and Python Implementation - Mass-evaluation swarm optimizer; clarifies when to choose GP-based Bayesian optimization vs PSO based on evaluation cost.
References
- Rasmussen, C. E., & Williams, C. K. I. (2006). Gaussian Processes for Machine Learning. MIT Press.
- Murphy, K. P. (2023). Probabilistic Machine Learning: Advanced Topics. MIT Press, Ch. 18.
- Snelson, E., & Ghahramani, Z. (2006). “Sparse Gaussian Processes using Pseudo-inputs.” NeurIPS 2006.
- Hensman, J., Fusi, N., & Lawrence, N. D. (2013). “Gaussian Processes for Big Data.” UAI 2013.