Introduction
In Bayesian inference, sampling from the posterior distribution \(p(\theta | D)\) is a central challenge. In most cases, this distribution is analytically intractable and cannot be directly sampled.
Markov Chain Monte Carlo (MCMC) constructs a Markov chain whose stationary distribution is the target distribution, generating samples from it. Running the chain long enough yields approximate samples from the posterior.
Markov Chain Basics
A Markov chain is a stochastic process where the next state depends only on the current state:
\[P(X_{t+1} | X_1, \ldots, X_t) = P(X_{t+1} | X_t) \tag{1}\]The goal of MCMC is to design a Markov chain with target distribution \(\pi(\theta)\) as its stationary distribution, guaranteed by the detailed balance condition:
\[\pi(\theta) T(\theta' | \theta) = \pi(\theta') T(\theta | \theta') \tag{2}\]Metropolis-Hastings Algorithm
Algorithm
- Set initial value \(\theta_0\)
- Generate candidate \(\theta'\) from proposal distribution \(q(\theta' | \theta_t)\)
- Compute acceptance probability:
- Accept \(\theta_{t+1} = \theta'\) with probability \(\alpha\); otherwise \(\theta_{t+1} = \theta_t\)
- Repeat steps 2-4
For symmetric proposals (\(q(\theta' | \theta) = q(\theta | \theta')\)), this simplifies to the Metropolis algorithm:
\[\alpha = \min\left(1, \frac{\pi(\theta')}{\pi(\theta_t)}\right) \tag{4}\]Crucially, the normalization constant of \(\pi(\theta)\) is not needed. In Bayesian inference, only the unnormalized density \(p(\theta | D) \propto p(D | \theta) p(\theta)\) is required.
Python Implementation
Sampling from a mixture of Gaussians:
import numpy as np
import matplotlib.pyplot as plt
def metropolis_hastings(log_target, initial, n_samples, proposal_std=1.0):
"""Metropolis-Hastings algorithm"""
samples = [initial]
current = initial
accepted = 0
for _ in range(n_samples):
proposal = current + np.random.normal(0, proposal_std)
log_alpha = log_target(proposal) - log_target(current)
if np.log(np.random.random()) < log_alpha:
current = proposal
accepted += 1
samples.append(current)
acceptance_rate = accepted / n_samples
return np.array(samples), acceptance_rate
# --- Target: Gaussian mixture ---
def log_target(x):
"""Log target density: 0.3*N(-2,1) + 0.7*N(3,0.5)"""
from scipy.special import logsumexp
log_p1 = np.log(0.3) - 0.5 * (x + 2)**2
log_p2 = np.log(0.7) - (x - 3)**2
return logsumexp([log_p1, log_p2])
# --- Run ---
np.random.seed(42)
samples, acc_rate = metropolis_hastings(log_target, initial=0.0,
n_samples=50000, proposal_std=1.5)
print(f"Acceptance rate: {acc_rate:.3f}")
burn_in = 5000
samples = samples[burn_in:]
# --- Visualization ---
fig, axes = plt.subplots(1, 2, figsize=(12, 4))
axes[0].plot(samples[:2000], alpha=0.5, linewidth=0.5)
axes[0].set_xlabel('Iteration')
axes[0].set_ylabel('Sample value')
axes[0].set_title('Trace Plot')
axes[0].grid(True, alpha=0.3)
x = np.linspace(-6, 6, 200)
true_pdf = 0.3 * np.exp(-0.5 * (x + 2)**2) / np.sqrt(2*np.pi) + \
0.7 * np.exp(-0.5 * ((x - 3)/0.5)**2) / (0.5 * np.sqrt(2*np.pi))
axes[1].hist(samples, bins=100, density=True, alpha=0.5, label='MCMC samples')
axes[1].plot(x, true_pdf, 'r-', linewidth=2, label='True density')
axes[1].set_xlabel('x')
axes[1].set_ylabel('Density')
axes[1].set_title('Histogram vs True Distribution')
axes[1].legend()
axes[1].grid(True, alpha=0.3)
plt.tight_layout()
plt.show()
Gibbs Sampling
For multidimensional cases, sample each variable sequentially from its conditional distribution. For two variables \((\theta_1, \theta_2)\):
- \(\theta_1^{(t+1)} \sim p(\theta_1 | \theta_2^{(t)}, D)\)
- \(\theta_2^{(t+1)} \sim p(\theta_2 | \theta_1^{(t+1)}, D)\)
Particularly effective when conditional distributions are known (e.g., conjugate priors). It is a special case of Metropolis-Hastings with acceptance rate always 1.
Practical Considerations
Burn-in
Discard the first several thousand samples (burn-in period) to remove the influence of the initial value.
Acceptance Rate Tuning
The optimal acceptance rate is approximately 23.4% for high-dimensional problems. Too large a proposal standard deviation lowers acceptance; too small slows exploration.
Autocorrelation and Thinning
Consecutive samples are correlated. When independent samples are needed, apply thinning (keeping every \(k\)-th sample).
Related Articles
- Bayesian Linear Regression Fundamentals - Analytical Bayesian inference without MCMC.
- Cross-Entropy Method: A Practical Monte Carlo Optimization Technique - Monte Carlo methods applied to optimization.
- Particle Filter Python Implementation - Sequential Monte Carlo (SMC), closely related to MCMC.
- Bayesian Optimization: Fundamentals and Python Implementation - Using MCMC-estimated posteriors in Bayesian optimization.
References
- Metropolis, N., et al. (1953). “Equation of State Calculations by Fast Computing Machines”. The Journal of Chemical Physics, 21(6), 1087-1092.
- Hastings, W. K. (1970). “Monte Carlo Sampling Methods Using Markov Chains and Their Applications”. Biometrika, 57(1), 97-109.
- Gelman, A., et al. (2013). Bayesian Data Analysis (3rd ed.). Chapman and Hall/CRC. Chapters 11-12.