Introduction
The LMS (Least Mean Squares) algorithm iteratively approximates the Wiener-filter optimum using stochastic gradient descent. Its computational cost is just \(O(L)\) per sample, it is numerically stable, and it underpins echo cancellation, noise cancellation, channel equalization, and system identification in production systems.
The Wiener filter requires the input statistics — autocorrelation \(R_{yy}\) and cross-correlation \(R_{sy}\) — to be known. LMS removes that assumption: it updates the filter coefficients directly from observed samples, which is far more practical.
Problem Formulation
Consider a length-\(L\) FIR adaptive filter. At time \(n\) define the input vector and coefficient vector
\[ \mathbf{x}(n) = [x(n), x(n-1), \dots, x(n-L+1)]^\top \tag{1} \] \[ \mathbf{w}(n) = [w_0(n), w_1(n), \dots, w_{L-1}(n)]^\top \tag{2} \]The filter output and error are
\[ y(n) = \mathbf{w}(n)^\top \mathbf{x}(n), \qquad e(n) = d(n) - y(n) \tag{3} \]where \(d(n)\) is the desired (reference) signal. We minimize the expected squared error
\[ J(\mathbf{w}) = E\bigl[e(n)^2\bigr] \tag{4} \]iteratively.
LMS Algorithm
Stochastic Gradient Approximation
The exact gradient of \(J(\mathbf{w})\) is
\[ \nabla J = -2\, E\bigl[e(n)\, \mathbf{x}(n)\bigr] \tag{5} \]but the expectation is unknown in practice. The key idea of LMS is to replace the expectation with the instantaneous value:
\[ \hat{\nabla} J(n) = -2\, e(n)\, \mathbf{x}(n) \tag{6} \]Substituting into steepest descent yields the LMS update rule:
\[ \mathbf{w}(n+1) = \mathbf{w}(n) + \mu\, e(n)\, \mathbf{x}(n) \tag{7} \]Here \(\mu > 0\) is the step size (learning rate); the factor of 2 is absorbed into \(\mu\) .
Convergence Conditions
For mean convergence (\(E[\mathbf{w}(n)] \to \mathbf{w}^\ast\) ), the step size must satisfy
\[ 0 < \mu < \frac{2}{\lambda_{\max}} \tag{8} \]where \(\lambda_{\max}\) is the largest eigenvalue of the input autocorrelation matrix \(\mathbf{R} = E[\mathbf{x}(n)\mathbf{x}(n)^\top]\) . A more conservative practical bound uses the input power:
\[ 0 < \mu < \frac{2}{L \cdot P_x}, \qquad P_x = E\bigl[x(n)^2\bigr] \tag{9} \]which follows from \(\lambda_{\max} \le \mathrm{tr}(\mathbf{R}) = L P_x\) .
The qualitative behavior:
| Step size \(\mu\) | Behavior |
|---|---|
| Small | Slow convergence, low steady-state error |
| Moderate | Balanced trade-off |
| Large | Fast convergence, high steady-state error |
| Above bound | Diverges (error grows unboundedly) |
The eigenvalue spread \(\chi(\mathbf{R}) = \lambda_{\max} / \lambda_{\min}\) governs convergence speed: highly colored inputs converge slowly. See Autocorrelation Function for details.
NLMS Algorithm
Why Normalize
LMS convergence couples \(\mu\) with input power \(P_x\) . With non-stationary inputs (e.g., speech), a fixed \(\mu\) is simultaneously too small in quiet periods and too large in loud ones, which compromises stability.
NLMS (Normalized LMS) normalizes the step by the input vector norm:
\[ \mathbf{w}(n+1) = \mathbf{w}(n) + \frac{\tilde{\mu}}{\lVert \mathbf{x}(n) \rVert^2 + \varepsilon}\, e(n)\, \mathbf{x}(n) \tag{10} \]The normalized step \(\tilde{\mu} \in (0, 2)\) no longer depends on input power. The small constant \(\varepsilon\) guards against division by tiny norms.
LMS vs NLMS
| Feature | LMS | NLMS |
|---|---|---|
| Cost | \(O(L)\) | \(O(L)\) |
| Step bound | \(\mu < 2/(L P_x)\) (input-dependent) | \(\tilde{\mu} < 2\) (input-free) |
| Robust to input swing | Weak | Strong |
| Implementation effort | Minimal | LMS plus a normalization term |
Python Implementation
System Identification Example
We identify an unknown FIR system \(\mathbf{w}^\ast\) driven by white noise, using both LMS and NLMS, and plot the learning curves (squared error over iterations).
import numpy as np
import matplotlib.pyplot as plt
np.random.seed(0)
# --- Unknown system to identify ---
L = 16
w_true = np.array([0.5, -0.3, 0.2, 0.1, -0.05, 0.02,
0.0, 0.0, 0.0, 0.0, 0.0, 0.0,
0.0, 0.0, 0.0, 0.0])
# --- Input and observation ---
N = 5000
x = np.random.randn(N)
v = 0.05 * np.random.randn(N) # measurement noise
d = np.convolve(x, w_true, mode='full')[:N] + v
def lms(x, d, L, mu):
"""Standard LMS algorithm."""
N = len(x)
w = np.zeros(L)
e = np.zeros(N)
for n in range(L, N):
x_vec = x[n:n - L:-1]
y = w @ x_vec
e[n] = d[n] - y
w = w + mu * e[n] * x_vec
return w, e
def nlms(x, d, L, mu_tilde, eps=1e-6):
"""Normalized LMS algorithm."""
N = len(x)
w = np.zeros(L)
e = np.zeros(N)
for n in range(L, N):
x_vec = x[n:n - L:-1]
y = w @ x_vec
e[n] = d[n] - y
norm = x_vec @ x_vec + eps
w = w + (mu_tilde / norm) * e[n] * x_vec
return w, e
# --- Run ---
mu_lms = 0.01
mu_nlms = 0.5
w_lms, e_lms = lms(x, d, L, mu_lms)
w_nlms, e_nlms = nlms(x, d, L, mu_nlms)
# --- Learning curve (smoothed squared error) ---
def smooth(v, win=50):
kernel = np.ones(win) / win
return np.convolve(v ** 2, kernel, mode='same')
plt.figure(figsize=(10, 5))
plt.semilogy(smooth(e_lms), label=f'LMS (μ={mu_lms})')
plt.semilogy(smooth(e_nlms), label=f'NLMS (μ̃={mu_nlms})')
plt.xlabel('Iteration n')
plt.ylabel('Mean Squared Error')
plt.title('Learning Curves: LMS vs NLMS')
plt.legend()
plt.grid(True, alpha=0.3)
plt.tight_layout()
plt.show()
print(f"LMS final coeff error: {np.linalg.norm(w_lms - w_true):.4f}")
print(f"NLMS final coeff error: {np.linalg.norm(w_nlms - w_true):.4f}")
The log-scale curves show an exponential decay phase followed by a flat steady-state misadjustment floor. Increasing \(\mu\) (or \(\tilde{\mu}\) ) speeds up the initial decay but raises the floor — the canonical convergence-vs-misadjustment trade-off.
Step-Size Sensitivity
plt.figure(figsize=(10, 5))
for mu in [0.005, 0.02, 0.05, 0.1]:
_, e = lms(x, d, L, mu)
plt.semilogy(smooth(e), label=f'μ={mu}', alpha=0.8)
plt.xlabel('Iteration n')
plt.ylabel('Mean Squared Error')
plt.title('LMS: Sensitivity to Step Size μ')
plt.legend()
plt.grid(True, alpha=0.3)
plt.tight_layout()
plt.show()
Tiny \(\mu\) converges painfully slowly; values near the upper bound oscillate from the start. As a rule of thumb, choose at most half of the bound in (8).
Connection to the Wiener Filter
LMS converges in mean to the Wiener solution \(\mathbf{w}^\ast = \mathbf{R}^{-1}\mathbf{p}\) , with \(\mathbf{p} = E[d(n)\mathbf{x}(n)]\) :
\[ E\bigl[\mathbf{w}(n)\bigr] \xrightarrow{n \to \infty} \mathbf{w}^\ast \tag{11} \]Because LMS uses an instantaneous gradient, a finite excess MSE remains. RLS (Recursive Least Squares) accelerates convergence at higher cost.
When to Use What
| Use case | Choice | Reason |
|---|---|---|
| Stationary input level | LMS | Smallest implementation, lowest cost |
| Speech / time-varying signals | NLMS | Robust to input-power swings |
| Fast convergence is critical | RLS | Quadratic cost \(O(L^2)\) but quick |
| Embedded / low-resource targets | LMS / NLMS | \(O(L)\) is sufficient |
Related Articles
- Wiener Filter: Optimal Linear Filtering Theory and Python Implementation - The optimum LMS approximates iteratively. Compares batch (frequency-domain) vs adaptive (time-domain) approaches.
- Adaptive Filters (LMS/RLS): Theory and Python Implementation - Broader treatment including RLS and noise-cancellation applications.
- Autocorrelation Function: Theory and Python Implementation - The input autocorrelation matrix \(\mathbf{R}\) governs LMS convergence speed.
- Fast Fourier Transform (FFT): Theory and Python Implementation - Foundation for frequency-domain adaptive filters (FDAF).
- Kalman Filter: Theory and Python Implementation - A different recursive estimator; RLS is mathematically equivalent to a Kalman filter.
- Moving Average Filters Compared: SMA, WMA, and EMA in Python - Contrast with fixed-coefficient filters.
References
- Haykin, S. (2014). Adaptive Filter Theory (5th ed.). Pearson. Chapters 5-6.
- Widrow, B., & Stearns, S. D. (1985). Adaptive Signal Processing. Prentice Hall.
- Sayed, A. H. (2008). Adaptive Filters. Wiley-IEEE Press.