Markov Chain Monte Carlo (MCMC)

Markov Property and Markov Chains

Consider a sequence of random variables \(x^{(1)}, x^{(2)}, x^{(3)}, \dots\) at discrete time steps \(t=1, 2, 3, \dots\).

Markov Chain: A system where the sequence of random variables \(x^{(1)}, x^{(2)}, x^{(3)}, \dots\) is generated according to a certain rule.
Markov Property: The property that the state at a given time depends only on the immediately preceding state (or a limited number of past states), rather than on the entire history.
- First-order Markov property: The state \(x^{(t)}\) depends only on \(x^{(t-1)}\).
- \(n\)-th order Markov property: The state \(x^{(t)}\) depends only on \(x^{(t-1)}, \dots, x^{(t-n)}\).
Transition probability \(p(x^{(t)}|x^{(t-1)})\): The probability of transitioning from state \(x^{(t-1)}\) at time \(t-1\) to state \(x^{(t)}\) at time \(t\).
Stationary distribution \(\pi(x)\): The probability distribution over states that a Markov chain converges to after running for a sufficiently long time. The stationary distribution satisfies the following balance equation: \( \pi(y) = \sum_x p(y|x)\pi(x) \)

Markov Chain Monte Carlo (MCMC)

MCMC is a method for generating samples from complex probability distributions that are difficult to sample from directly. It works by designing transition probabilities so that the stationary distribution of a Markov chain matches the target distribution, and then simulating that Markov chain to generate samples.

Among the sequence of samples \(x^{(1)}, x^{(2)}, \dots\) generated by MCMC, the initial samples are strongly influenced by the starting value of the Markov chain. These are discarded as the burn-in period. Samples from after a sufficiently long burn-in period (\(x^{(t)}, x^{(t+1)}, \dots\) for sufficiently large \(t\)) are used for inference.

One of the most widely used MCMC algorithms is the Metropolis-Hastings algorithm.

Metropolis-Hastings Algorithm

Suppose we want to obtain samples from a target distribution \(p(x)\).

Set an initial state \(x^{(1)}\) randomly.
Repeat the following steps (\(t=1, 2, \dots\)): a. From the current state \(x^{(t)}\), sample a new candidate state \(\hat{y}\) from the proposal distribution \(q(\hat{y}|x^{(t)})\). b. Compute the acceptance probability \(\alpha(x^{(t)}, \hat{y})\). \( \alpha(x^{(t)}, \hat{y}) = \min\left(1, \frac{p(\hat{y})q(x^{(t)}|\hat{y})}{p(x^{(t)})q(\hat{y}|x^{(t)})}\right) \) Here, \(p(x)\) is the target distribution. c. Generate a uniform random number \(u\) from \([0, 1]\). If \(u < \alpha(x^{(t)}, \hat{y})\), accept the candidate \(\hat{y}\) as the next state \(x^{(t+1)}\). Otherwise, set \(x^{(t+1)} = x^{(t)}\) and remain in the current state.

By repeating this process, the generated sequence converges to the target distribution \(p(x)\).

References

Taro Tezuka, “Understanding Bayesian Statistics and Machine Learning,” Kodansha (2017)

Markov Property and Markov Chains

Markov Chain Monte Carlo (MCMC)

Metropolis-Hastings Algorithm

References

関連記事

Variational Autoencoder (VAE)

Variational Bayes

The EM Algorithm (2): Evidence Lower Bound and KL Divergence