Maximum Likelihood Estimation (MLE) is one of the most common methods for estimating the parameters of statistical models. It was systematized by the statistician Ronald Fisher in the early 20th century.
The Idea Behind Maximum Likelihood Estimation
MLE estimates the parameters that maximize the likelihood function. The likelihood function can be interpreted as “the probability (or probability density) that the given data was generated by a model with specific parameters.”
Suppose we have observed data $x = {x_1, x_2, \dots, x_n}$ that follows a probability distribution $p(x|\theta)$ with parameter $\theta$. If each observation is independently and identically distributed (i.i.d.), the likelihood function $L(\theta|x)$ is defined as the product of the probabilities of each observation.
$$ L(\theta|x) = p(x|\theta) = \prod_{i=1}^n p(x_i|\theta) $$
MLE finds the parameter $\hat{\theta}_{ML}$ that maximizes this likelihood function $L(\theta|x)$.
Log-Likelihood Function
Since the likelihood function is a product, computations can become complex, and differentiating products is cumbersome. Therefore, it is common to maximize the log-likelihood function $\log L(\theta|x)$ instead, obtained by taking the logarithm (a monotonically increasing function). The logarithm converts products into sums, making differentiation easier.
$$ \log L(\theta|x) = \sum_{i=1}^n \log p(x_i|\theta) $$
Since the logarithm is monotonically increasing, maximizing the likelihood function is equivalent to maximizing the log-likelihood function.
Maximum Likelihood Estimation for the Normal Distribution
Here, we assume that observed data $x = {x_1, x_2, \dots, x_n}$ follows a normal distribution $\mathcal{N}(x | \mu, \sigma^2)$, and derive the MLE for its parameters: the mean $\mu$ and variance $\sigma^2$.
The probability density function of the normal distribution is: $$ p(x | \mu, \sigma^2) = \frac{1}{\sqrt{2\pi\sigma^2}} \exp\left(-\frac{(x - \mu)^2}{2\sigma^2}\right) $$
The log-likelihood function becomes: $$ \log L(\mu, \sigma^2 | x) = -\frac{1}{2\sigma^2} \sum_{i=1}^n (x_i - \mu)^2 - \frac{n}{2} \log(2\pi\sigma^2) $$
MLE of the Mean $\mu$
We take the partial derivative of the log-likelihood function with respect to $\mu$ and set it to zero.
$$ \frac{\partial \log L}{\partial \mu} = -\frac{1}{2\sigma^2} \sum_{i=1}^n 2(x_i - \mu)(-1) = \frac{1}{\sigma^2} \sum_{i=1}^n (x_i - \mu) $$
Setting this to zero: $$ \sum_{i=1}^n (x_i - \mu) = 0 \implies \sum_{i=1}^n x_i - n\mu = 0 \implies \hat{\mu}{ML} = \frac{1}{n} \sum{i=1}^n x_i $$
Therefore, the MLE of the mean of a normal distribution equals the sample mean. This is one reason why the arithmetic mean is so widely used in statistics.
MLE of the Variance $\sigma^2$
We take the partial derivative of the log-likelihood function with respect to $\sigma^2$ and set it to zero.
$$ \frac{\partial \log L}{\partial \sigma^2} = -\frac{1}{2} \sum_{i=1}^n (x_i - \mu)^2 \left(-\frac{1}{(\sigma^2)^2}\right) - \frac{n}{2} \frac{1}{\sigma^2} $$ $$ = \frac{1}{2(\sigma^2)^2} \sum_{i=1}^n (x_i - \mu)^2 - \frac{n}{2\sigma^2} $$
Setting this to zero: $$ \frac{1}{(\sigma^2)^2} \sum_{i=1}^n (x_i - \mu)^2 = \frac{n}{\sigma^2} \implies \hat{\sigma}^2_{ML} = \frac{1}{n} \sum_{i=1}^n (x_i - \mu)^2 $$
Here, we substitute the previously derived MLE $\hat{\mu}_{ML}$ for $\mu$.
$$ \hat{\sigma}^2_{ML} = \frac{1}{n} \sum_{i=1}^n (x_i - \hat{\mu}_{ML})^2 $$
This coincides with the definition of the sample variance. Note, however, that this differs from the unbiased variance (which divides by $n-1$). The sample variance as an MLE tends to slightly underestimate the true variance.
References
- Taro Tezuka, “Understanding Bayesian Statistics and Machine Learning,” Kodansha (2017)