Gaussian Process Regression: Fundamentals and Python Implementation

What is a Gaussian Process?

A Gaussian process (GP) is a probabilistic model that treats the function itself as a random variable. Intuitively, you can think of it as a multivariate normal distribution extended to infinite dimensions.

While an ordinary normal distribution defines a distribution over a vector \(\mathbf{y} \in \mathbb{R}^n\) , a GP defines a distribution over a function \(f: \mathcal{X} \rightarrow \mathbb{R}\) . The formal definition: \(f\) is a Gaussian process if, for any finite set of inputs \(\{\mathbf{x}_1, \ldots, \mathbf{x}_n\}\) , the corresponding vector of function values \([f(\mathbf{x}_1), \ldots, f(\mathbf{x}_n)]^T\) follows a multivariate Gaussian distribution.

A GP is fully specified by a mean function \(m(\mathbf{x})\) and a kernel (covariance) function \(k(\mathbf{x}, \mathbf{x}')\) :

\[ f(\mathbf{x}) \sim \mathcal{GP}(m(\mathbf{x}), k(\mathbf{x}, \mathbf{x}')) \tag{1} \]

In practice \(m(\mathbf{x}) = 0\) is a common choice; the function shape is controlled by the kernel.

Kernel Functions (Designing the Prior)

The kernel \(k(\mathbf{x}, \mathbf{x}')\) specifies the covariance of function values at two inputs and encodes prior beliefs about smoothness, periodicity, and so on.

RBF (Squared Exponential) Kernel

The most popular choice is the RBF kernel, also called the squared exponential kernel:

\[ k_{\text{RBF}}(\mathbf{x}, \mathbf{x}') = \sigma_f^2 \exp\left(-\frac{\|\mathbf{x} - \mathbf{x}'\|^2}{2 \ell^2}\right) \tag{2} \]

\(\sigma_f^2\) : output variance (function amplitude)
\(\ell\) : length scale, controlling how quickly correlation decays with distance

A larger \(\ell\) produces smoother functions; a smaller \(\ell\) allows rapid local variation. RBF samples are infinitely differentiable.

Matern Kernel

For real problems the RBF kernel is often “too smooth”. The Matern kernel lets you control differentiability via the smoothness parameter \(\nu\) :

\[ k_{\text{Matern}}(\mathbf{x}, \mathbf{x}') = \sigma_f^2 \frac{2^{1-\nu}}{\Gamma(\nu)} \left(\frac{\sqrt{2\nu}\, \|\mathbf{x} - \mathbf{x}'\|}{\ell}\right)^{\nu} K_{\nu}\left(\frac{\sqrt{2\nu}\, \|\mathbf{x} - \mathbf{x}'\|}{\ell}\right) \tag{3} \]

where \(K_{\nu}\) is the modified Bessel function of the second kind. As \(\nu = 1/2\) it reduces to the exponential kernel; as \(\nu \to \infty\) it converges to the RBF kernel. In practice \(\nu = 3/2\) and \(\nu = 5/2\) are most common.

Conditioning on Observations (Posterior)

The real power of a GP is that, given observations, the posterior is available in closed form.

For \(n\) noisy observations \(\mathcal{D} = \{(\mathbf{x}_i, y_i)\}_{i=1}^n\) :

\[ y_i = f(\mathbf{x}_i) + \varepsilon_i, \quad \varepsilon_i \sim \mathcal{N}(0, \sigma_n^2) \tag{4} \]

The joint distribution of the observed targets \(\mathbf{y}\) and a new function value \(f_*\) at \(\mathbf{x}_*\) is:

\[ \begin{bmatrix} \mathbf{y} \\ f_* \end{bmatrix} \sim \mathcal{N}\left(\mathbf{0}, \begin{bmatrix} \mathbf{K} + \sigma_n^2 \mathbf{I} & \mathbf{k}_* \\ \mathbf{k}_*^T & k(\mathbf{x}_*, \mathbf{x}_*) \end{bmatrix} \right) \tag{5} \]

where \(K_{ij} = k(\mathbf{x}_i, \mathbf{x}_j)\) and \(\mathbf{k}_* = [k(\mathbf{x}_1, \mathbf{x}_*), \ldots, k(\mathbf{x}_n, \mathbf{x}_*)]^T\) . Standard Gaussian conditioning yields the posterior predictive distribution in closed form:

\[ \mu_*(\mathbf{x}_*) = \mathbf{k}_*^T (\mathbf{K} + \sigma_n^2 \mathbf{I})^{-1} \mathbf{y} \tag{6} \] \[ \sigma_*^2(\mathbf{x}_*) = k(\mathbf{x}_*, \mathbf{x}_*) - \mathbf{k}_*^T (\mathbf{K} + \sigma_n^2 \mathbf{I})^{-1} \mathbf{k}_* \tag{7} \]

Equation (6) is the predictive mean; (7) is the predictive variance. The variance shrinks where data is dense and grows toward the prior variance \(\sigma_f^2\) far from observations. This automatic uncertainty calibration is the core advantage GPs hold over point-estimate models such as neural networks.

Marginal Likelihood and Hyperparameters

The kernel hyperparameters \(\boldsymbol{\theta} = (\sigma_f^2, \ell, \sigma_n^2)\) are typically learned by maximizing the log marginal likelihood:

\[ \log p(\mathbf{y} \mid \mathbf{X}, \boldsymbol{\theta}) = -\frac{1}{2}\mathbf{y}^T (\mathbf{K} + \sigma_n^2 \mathbf{I})^{-1} \mathbf{y} - \frac{1}{2} \log |\mathbf{K} + \sigma_n^2 \mathbf{I}| - \frac{n}{2} \log 2\pi \tag{8} \]

The first term measures data fit, while the second term is a complexity penalty (Occam’s razor). This penalty is the reason GPs avoid overfitting in a Bayesian way.

Python Implementation: 1D GP Regression

We implement GP regression from scratch with NumPy and SciPy. Cholesky factorization is used for numerical stability instead of explicit matrix inversion.

import numpy as np
import matplotlib.pyplot as plt
from scipy.linalg import cho_factor, cho_solve

class GaussianProcessRegressor:
    def __init__(self, length_scale=1.0, signal_var=1.0, noise_var=1e-4):
        self.l = length_scale
        self.sf2 = signal_var
        self.sn2 = noise_var

    def rbf(self, X1, X2):
        """Eq. (2): RBF kernel"""
        d2 = (np.sum(X1**2, axis=1, keepdims=True)
              - 2.0 * X1 @ X2.T
              + np.sum(X2**2, axis=1))
        return self.sf2 * np.exp(-0.5 * d2 / self.l**2)

    def fit(self, X, y):
        self.X_train = X.copy()
        self.y_train = y.copy()
        K = self.rbf(X, X) + self.sn2 * np.eye(len(X))
        # Stable Cholesky factorization
        self.L_, self.lower_ = cho_factor(K, lower=True)
        self.alpha_ = cho_solve((self.L_, self.lower_), y)

    def predict(self, X_new):
        """Eqs. (6), (7): posterior mean and variance"""
        k_star = self.rbf(self.X_train, X_new)
        mu = k_star.T @ self.alpha_
        v = cho_solve((self.L_, self.lower_), k_star)
        var = self.sf2 - np.sum(k_star * v, axis=0)
        var = np.maximum(var, 1e-10)
        return mu, var

    def log_marginal_likelihood(self):
        """Eq. (8): log marginal likelihood"""
        n = len(self.y_train)
        log_det = 2.0 * np.sum(np.log(np.diag(self.L_)))
        return (-0.5 * self.y_train @ self.alpha_
                - 0.5 * log_det
                - 0.5 * n * np.log(2 * np.pi))

Posterior Sampling and Prediction

We construct the posterior from noisy observations and visualize the 95% confidence band.

# Underlying true function
def true_f(x):
    return np.sin(2.0 * x) + 0.3 * x

np.random.seed(0)
X_train = np.array([[-3.0], [-2.0], [-0.5], [1.0], [2.5], [3.5]])
y_train = true_f(X_train.ravel()) + 0.1 * np.random.randn(len(X_train))

gp = GaussianProcessRegressor(length_scale=1.0, signal_var=1.0, noise_var=0.01)
gp.fit(X_train, y_train)

X_test = np.linspace(-5.0, 5.0, 300).reshape(-1, 1)
mu, var = gp.predict(X_test)
sigma = np.sqrt(var)

plt.figure(figsize=(10, 5))
plt.plot(X_test, true_f(X_test.ravel()), "k--", label="True function")
plt.plot(X_test, mu, "b-", label="GP mean")
plt.fill_between(X_test.ravel(), mu - 1.96 * sigma, mu + 1.96 * sigma,
                 alpha=0.2, color="blue", label="95% CI")
plt.scatter(X_train, y_train, c="red", s=60, zorder=5, label="Observations")
plt.xlabel("x"); plt.ylabel("f(x)")
plt.title("Gaussian Process Regression (RBF kernel)")
plt.legend(); plt.grid(True); plt.tight_layout()
plt.savefig("gp_regression.png", dpi=150)

print(f"log marginal likelihood: {gp.log_marginal_likelihood():.4f}")

The confidence band narrows around observations and widens away from data, which is the direct consequence of the first term \(k(\mathbf{x}_*, \mathbf{x}_*) = \sigma_f^2\) dominating Eq. (7) far from the training set.

Effect of Hyperparameters

The length scale \(\ell\) and noise variance \(\sigma_n^2\) shape GP behavior dramatically.

Length scale \(\ell\)

Small \(\ell\) : rapid local variation, sharp interpolation between data; risk of overfitting.
Large \(\ell\) : smooth approximation that does not pass through every point; risk of underfitting.

fig, axes = plt.subplots(1, 3, figsize=(15, 4))
for ax, l in zip(axes, [0.3, 1.0, 3.0]):
    gp = GaussianProcessRegressor(length_scale=l, signal_var=1.0, noise_var=0.01)
    gp.fit(X_train, y_train)
    mu, var = gp.predict(X_test)
    sigma = np.sqrt(var)
    ax.plot(X_test, true_f(X_test.ravel()), "k--")
    ax.plot(X_test, mu, "b-")
    ax.fill_between(X_test.ravel(), mu - 1.96*sigma, mu + 1.96*sigma,
                    alpha=0.2, color="blue")
    ax.scatter(X_train, y_train, c="red", s=40, zorder=5)
    ax.set_title(f"length_scale = {l}")
    ax.grid(True)
plt.tight_layout()
plt.savefig("gp_length_scale.png", dpi=150)

Noise variance \(\sigma_n^2\)

A small \(\sigma_n^2\) forces the posterior to interpolate the data exactly; a larger value allows the curve to deviate from observations and stay smooth. It also acts as a numerical jitter stabilizing the inversion of \(\mathbf{K} + \sigma_n^2 \mathbf{I}\) in Eq. (6) (typical jitter values: \(10^{-6}\) to \(10^{-4}\) ).

In practice, hyperparameters are tuned automatically by maximizing the log marginal likelihood (Eq. 8) with scipy.optimize.minimize.

Computational Cost and Approximations

The main weakness of a GP is the \(O(n^3)\) training cost (Cholesky factorization of the kernel matrix) and \(O(n^2)\) memory. Pure GP regression is generally limited to \(n \approx 10^4\) .

Common scalable approximations include:

Inducing point methods (FITC, VFE, SVGP): \(m \ll n\) pseudo-inputs reduce cost to \(O(nm^2)\)
Random Fourier Features: approximate the kernel with finite-dimensional random features
Local approximations: predict using only nearby data points

Connection to Bayesian Optimization

The GP posterior mean (Eq. 6) and variance (Eq. 7) serve as the surrogate model in Bayesian optimization. Acquisition functions such as Expected Improvement and UCB are functions of \(\mu_*\) and \(\sigma_*\) that prefer points that combine high uncertainty with expected improvement.

For example, UCB takes the simple form:

\[ \alpha_{\text{UCB}}(\mathbf{x}) = \mu_*(\mathbf{x}) - \kappa\, \sigma_*(\mathbf{x}) \tag{9} \]

In other words, the GP machinery you built above is the foundation of Bayesian optimization. See the next article for the full picture.

Bayesian Optimization: Fundamentals and Python Implementation - Uses the GP from this article as a surrogate model and adds acquisition functions (EI/UCB/PI) for sample-efficient global optimization.
Markov Chain Monte Carlo (MCMC): Metropolis-Hastings and Gibbs Sampling - Computational methods for Bayesian inference underlying GPs.
From SGD to Adam: Evolution of Gradient-Based Optimization - Gradient-based optimizers whose hyperparameters are typical targets of GP-based Bayesian optimization.
Support Vector Machine (SVM): Theory and Python Implementation - Another RBF-kernel method; contrast GP’s probabilistic predictions with SVM’s deterministic margin maximization.
Particle Swarm Optimization (PSO): Theory and Python Implementation - Mass-evaluation swarm optimizer; clarifies when to choose GP-based Bayesian optimization vs PSO based on evaluation cost.

References

Rasmussen, C. E., & Williams, C. K. I. (2006). Gaussian Processes for Machine Learning. MIT Press.
Murphy, K. P. (2023). Probabilistic Machine Learning: Advanced Topics. MIT Press, Ch. 18.
Snelson, E., & Ghahramani, Z. (2006). “Sparse Gaussian Processes using Pseudo-inputs.” NeurIPS 2006.
Hensman, J., Fusi, N., & Lawrence, N. D. (2013). “Gaussian Processes for Big Data.” UAI 2013.