Markov Chain Monte Carlo (MCMC)

An introduction to MCMC methods, covering Markov chains, stationarity, the Metropolis-Hastings algorithm, and how to sample from complex distributions.

Markov Property and Markov Chains

Consider a sequence of random variables \(x^{(1)}, x^{(2)}, x^{(3)}, \dots\) at discrete time steps \(t=1, 2, 3, \dots\).

  • Markov Chain: A system where the sequence of random variables \(x^{(1)}, x^{(2)}, x^{(3)}, \dots\) is generated according to a certain rule.
  • Markov Property: The property that the state at a given time depends only on the immediately preceding state (or a limited number of past states), rather than on the entire history.
    • First-order Markov property: The state \(x^{(t)}\) depends only on \(x^{(t-1)}\).
    • \(n\)-th order Markov property: The state \(x^{(t)}\) depends only on \(x^{(t-1)}, \dots, x^{(t-n)}\).
  • Transition probability \(p(x^{(t)}|x^{(t-1)})\): The probability of transitioning from state \(x^{(t-1)}\) at time \(t-1\) to state \(x^{(t)}\) at time \(t\).
  • Stationary distribution \(\pi(x)\): The probability distribution over states that a Markov chain converges to after running for a sufficiently long time. The stationary distribution satisfies the following balance equation: \( \pi(y) = \sum_x p(y|x)\pi(x) \)

Markov Chain Monte Carlo (MCMC)

MCMC is a method for generating samples from complex probability distributions that are difficult to sample from directly. It works by designing transition probabilities so that the stationary distribution of a Markov chain matches the target distribution, and then simulating that Markov chain to generate samples.

Among the sequence of samples \(x^{(1)}, x^{(2)}, \dots\) generated by MCMC, the initial samples are strongly influenced by the starting value of the Markov chain. These are discarded as the burn-in period. Samples from after a sufficiently long burn-in period (\(x^{(t)}, x^{(t+1)}, \dots\) for sufficiently large \(t\)) are used for inference.

One of the most widely used MCMC algorithms is the Metropolis-Hastings algorithm.

Metropolis-Hastings Algorithm

Suppose we want to obtain samples from a target distribution \(p(x)\).

  1. Set an initial state \(x^{(1)}\) randomly.
  2. Repeat the following steps (\(t=1, 2, \dots\)): a. From the current state \(x^{(t)}\), sample a new candidate state \(\hat{y}\) from the proposal distribution \(q(\hat{y}|x^{(t)})\). b. Compute the acceptance probability \(\alpha(x^{(t)}, \hat{y})\). \( \alpha(x^{(t)}, \hat{y}) = \min\left(1, \frac{p(\hat{y})q(x^{(t)}|\hat{y})}{p(x^{(t)})q(\hat{y}|x^{(t)})}\right) \) Here, \(p(x)\) is the target distribution. c. Generate a uniform random number \(u\) from \([0, 1]\). If \(u < \alpha(x^{(t)}, \hat{y})\), accept the candidate \(\hat{y}\) as the next state \(x^{(t+1)}\). Otherwise, set \(x^{(t+1)} = x^{(t)}\) and remain in the current state.

By repeating this process, the generated sequence converges to the target distribution \(p(x)\).

References

  • Taro Tezuka, “Understanding Bayesian Statistics and Machine Learning,” Kodansha (2017)