The Binomial Distribution and Related Distributions

An overview of the binomial distribution family: Bernoulli, binomial, categorical, and multinomial distributions with their probability mass functions.

Bernoulli Distribution

The Bernoulli distribution is a discrete probability distribution that models a trial with only two possible outcomes (e.g., heads or tails in a coin toss, success or failure). Typically, success is represented as 1 and failure as 0.

  • If the probability of success is \(\mu\), then the probability of failure is \(1 - \mu\).
  • \(\mu\) takes values in the range \(0 \le \mu \le 1\).

The probability mass function is:

\[ p(x|\mu) = \mu^x (1 - \mu)^{1-x} \]

where \(x\) takes values of 0 or 1.

Binomial Distribution

The binomial distribution is a discrete probability distribution that represents the probability of obtaining \(r\) successes when independently repeating a Bernoulli trial \(m\) times.

The probability mass function is:

\[ p(r|m, \mu) = \binom{m}{r} \mu^r (1 - \mu)^{m-r} \]

where \(\binom{m}{r} = \frac{m!}{r!(m-r)!}\) is the binomial coefficient.

Categorical Distribution / Multinoulli Distribution

This is a generalization of the Bernoulli distribution to trials where the outcome falls into three or more categories. For example, it can model the outcome of rolling a die once.

  • The probability of each category \(j\) occurring is denoted \(\mu_j\).
  • The constraint \(\sum_{j=1}^k \mu_j = 1\) must be satisfied.

The probability mass function is:

\[ p(x|\mu) = \prod\_{j=1}^k \mu_j^{x_j} \]

where \(x\) is a one-hot vector (e.g., if category \(j\) occurs, only its element \(x_j\) is 1 and all others are 0).

This distribution is specifically called the categorical distribution when representing the result of a single trial.

Multinomial Distribution

This is a generalization of the binomial distribution to the case where a categorical trial is independently repeated \(m\) times, representing how many times each category appears.

The probability mass function is:

\[ p(x*1, \dots, x_k | m, \mu_1, \dots, \mu_k) = \frac{m!}{x_1! x_2! \dots x_k!} \mu_1^{x_1} \mu_2^{x_2} \dots \mu_k^{x_k} \]

where \(m = \sum*{j=1}^k x_j\) is the total number of trials.

References

  • Taro Tezuka, “Understanding Bayesian Statistics and Machine Learning,” Kodansha (2017)