Markov Chain Monte Carlo (MCMC)

An introduction to MCMC methods, covering Markov chains, stationarity, the Metropolis-Hastings algorithm, and how to sample from complex distributions.

Markov Property and Markov Chains

Consider a sequence of random variables $x^{(1)}, x^{(2)}, x^{(3)}, \dots$ at discrete time steps $t=1, 2, 3, \dots$.

  • Markov Chain: A system where the sequence of random variables $x^{(1)}, x^{(2)}, x^{(3)}, \dots$ is generated according to a certain rule.
  • Markov Property: The property that the state at a given time depends only on the immediately preceding state (or a limited number of past states), rather than on the entire history.
    • First-order Markov property: The state $x^{(t)}$ depends only on $x^{(t-1)}$.
    • $n$-th order Markov property: The state $x^{(t)}$ depends only on $x^{(t-1)}, \dots, x^{(t-n)}$.
  • Transition probability $p(x^{(t)}|x^{(t-1)})$: The probability of transitioning from state $x^{(t-1)}$ at time $t-1$ to state $x^{(t)}$ at time $t$.
  • Stationary distribution $\pi(x)$: The probability distribution over states that a Markov chain converges to after running for a sufficiently long time. The stationary distribution satisfies the following balance equation: $$ \pi(y) = \sum_x p(y|x)\pi(x) $$

Markov Chain Monte Carlo (MCMC)

MCMC is a method for generating samples from complex probability distributions that are difficult to sample from directly. It works by designing transition probabilities so that the stationary distribution of a Markov chain matches the target distribution, and then simulating that Markov chain to generate samples.

Among the sequence of samples $x^{(1)}, x^{(2)}, \dots$ generated by MCMC, the initial samples are strongly influenced by the starting value of the Markov chain. These are discarded as the burn-in period. Samples from after a sufficiently long burn-in period ($x^{(t)}, x^{(t+1)}, \dots$ for sufficiently large $t$) are used for inference.

One of the most widely used MCMC algorithms is the Metropolis-Hastings algorithm.

Metropolis-Hastings Algorithm

Suppose we want to obtain samples from a target distribution $p(x)$.

  1. Set an initial state $x^{(1)}$ randomly.
  2. Repeat the following steps ($t=1, 2, \dots$): a. From the current state $x^{(t)}$, sample a new candidate state $\hat{y}$ from the proposal distribution $q(\hat{y}|x^{(t)})$. b. Compute the acceptance probability $\alpha(x^{(t)}, \hat{y})$. $$ \alpha(x^{(t)}, \hat{y}) = \min\left(1, \frac{p(\hat{y})q(x^{(t)}|\hat{y})}{p(x^{(t)})q(\hat{y}|x^{(t)})}\right) $$ Here, $p(x)$ is the target distribution. c. Generate a uniform random number $u$ from $[0, 1]$. If $u < \alpha(x^{(t)}, \hat{y})$, accept the candidate $\hat{y}$ as the next state $x^{(t+1)}$. Otherwise, set $x^{(t+1)} = x^{(t)}$ and remain in the current state.

By repeating this process, the generated sequence converges to the target distribution $p(x)$.

References

  • Taro Tezuka, “Understanding Bayesian Statistics and Machine Learning,” Kodansha (2017)