Markov Chain Monte Carlo (MCMC): Metropolis-Hastings and Gibbs Sampling

Explains the fundamentals of MCMC methods including Metropolis-Hastings and Gibbs sampling, with Python implementation for Bayesian inference.

Introduction

In Bayesian inference, sampling from the posterior distribution \(p(\theta | D)\) is a central challenge. In most cases, this distribution is analytically intractable and cannot be directly sampled.

Markov Chain Monte Carlo (MCMC) constructs a Markov chain whose stationary distribution is the target distribution, generating samples from it. Running the chain long enough yields approximate samples from the posterior.

Markov Chain Basics

A Markov chain is a stochastic process where the next state depends only on the current state:

\[P(X_{t+1} | X_1, \ldots, X_t) = P(X_{t+1} | X_t) \tag{1}\]

The goal of MCMC is to design a Markov chain with target distribution \(\pi(\theta)\) as its stationary distribution, guaranteed by the detailed balance condition:

\[\pi(\theta) T(\theta' | \theta) = \pi(\theta') T(\theta | \theta') \tag{2}\]

Metropolis-Hastings Algorithm

Algorithm

  1. Set initial value \(\theta_0\)
  2. Generate candidate \(\theta'\) from proposal distribution \(q(\theta' | \theta_t)\)
  3. Compute acceptance probability:
\[\alpha = \min\left(1, \frac{\pi(\theta') q(\theta_t | \theta')}{\pi(\theta_t) q(\theta' | \theta_t)}\right) \tag{3}\]
  1. Accept \(\theta_{t+1} = \theta'\) with probability \(\alpha\); otherwise \(\theta_{t+1} = \theta_t\)
  2. Repeat steps 2-4

For symmetric proposals (\(q(\theta' | \theta) = q(\theta | \theta')\)), this simplifies to the Metropolis algorithm:

\[\alpha = \min\left(1, \frac{\pi(\theta')}{\pi(\theta_t)}\right) \tag{4}\]

Crucially, the normalization constant of \(\pi(\theta)\) is not needed. In Bayesian inference, only the unnormalized density \(p(\theta | D) \propto p(D | \theta) p(\theta)\) is required.

Python Implementation

Sampling from a mixture of Gaussians:

import numpy as np
import matplotlib.pyplot as plt

def metropolis_hastings(log_target, initial, n_samples, proposal_std=1.0):
    """Metropolis-Hastings algorithm"""
    samples = [initial]
    current = initial
    accepted = 0

    for _ in range(n_samples):
        proposal = current + np.random.normal(0, proposal_std)
        log_alpha = log_target(proposal) - log_target(current)

        if np.log(np.random.random()) < log_alpha:
            current = proposal
            accepted += 1

        samples.append(current)

    acceptance_rate = accepted / n_samples
    return np.array(samples), acceptance_rate

# --- Target: Gaussian mixture ---
def log_target(x):
    """Log target density: 0.3*N(-2,1) + 0.7*N(3,0.5)"""
    from scipy.special import logsumexp
    log_p1 = np.log(0.3) - 0.5 * (x + 2)**2
    log_p2 = np.log(0.7) - (x - 3)**2
    return logsumexp([log_p1, log_p2])

# --- Run ---
np.random.seed(42)
samples, acc_rate = metropolis_hastings(log_target, initial=0.0,
                                        n_samples=50000, proposal_std=1.5)
print(f"Acceptance rate: {acc_rate:.3f}")

burn_in = 5000
samples = samples[burn_in:]

# --- Visualization ---
fig, axes = plt.subplots(1, 2, figsize=(12, 4))

axes[0].plot(samples[:2000], alpha=0.5, linewidth=0.5)
axes[0].set_xlabel('Iteration')
axes[0].set_ylabel('Sample value')
axes[0].set_title('Trace Plot')
axes[0].grid(True, alpha=0.3)

x = np.linspace(-6, 6, 200)
true_pdf = 0.3 * np.exp(-0.5 * (x + 2)**2) / np.sqrt(2*np.pi) + \
           0.7 * np.exp(-0.5 * ((x - 3)/0.5)**2) / (0.5 * np.sqrt(2*np.pi))
axes[1].hist(samples, bins=100, density=True, alpha=0.5, label='MCMC samples')
axes[1].plot(x, true_pdf, 'r-', linewidth=2, label='True density')
axes[1].set_xlabel('x')
axes[1].set_ylabel('Density')
axes[1].set_title('Histogram vs True Distribution')
axes[1].legend()
axes[1].grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

Gibbs Sampling

For multidimensional cases, sample each variable sequentially from its conditional distribution. For two variables \((\theta_1, \theta_2)\):

  1. \(\theta_1^{(t+1)} \sim p(\theta_1 | \theta_2^{(t)}, D)\)
  2. \(\theta_2^{(t+1)} \sim p(\theta_2 | \theta_1^{(t+1)}, D)\)

Particularly effective when conditional distributions are known (e.g., conjugate priors). It is a special case of Metropolis-Hastings with acceptance rate always 1.

Practical Considerations

Burn-in

Discard the first several thousand samples (burn-in period) to remove the influence of the initial value.

Acceptance Rate Tuning

The optimal acceptance rate is approximately 23.4% for high-dimensional problems. Too large a proposal standard deviation lowers acceptance; too small slows exploration.

Autocorrelation and Thinning

Consecutive samples are correlated. When independent samples are needed, apply thinning (keeping every \(k\)-th sample).

References

  • Metropolis, N., et al. (1953). “Equation of State Calculations by Fast Computing Machines”. The Journal of Chemical Physics, 21(6), 1087-1092.
  • Hastings, W. K. (1970). “Monte Carlo Sampling Methods Using Markov Chains and Their Applications”. Biometrika, 57(1), 97-109.
  • Gelman, A., et al. (2013). Bayesian Data Analysis (3rd ed.). Chapman and Hall/CRC. Chapters 11-12.