Adaptive Filters: LMS and NLMS Algorithms — Theory and Python Implementation

LMS (Least Mean Squares) and NLMS (Normalized LMS) algorithms explained: derivation via stochastic gradient descent, step-size convergence conditions, and Python implementations for system identification with learning-curve plots.

Introduction

The LMS (Least Mean Squares) algorithm iteratively approximates the Wiener-filter optimum using stochastic gradient descent. Its computational cost is just \(O(L)\) per sample, it is numerically stable, and it underpins echo cancellation, noise cancellation, channel equalization, and system identification in production systems.

The Wiener filter requires the input statistics — autocorrelation \(R_{yy}\) and cross-correlation \(R_{sy}\) — to be known. LMS removes that assumption: it updates the filter coefficients directly from observed samples, which is far more practical.

Problem Formulation

Consider a length-\(L\) FIR adaptive filter. At time \(n\) define the input vector and coefficient vector

\[ \mathbf{x}(n) = [x(n), x(n-1), \dots, x(n-L+1)]^\top \tag{1} \] \[ \mathbf{w}(n) = [w_0(n), w_1(n), \dots, w_{L-1}(n)]^\top \tag{2} \]

The filter output and error are

\[ y(n) = \mathbf{w}(n)^\top \mathbf{x}(n), \qquad e(n) = d(n) - y(n) \tag{3} \]

where \(d(n)\) is the desired (reference) signal. We minimize the expected squared error

\[ J(\mathbf{w}) = E\bigl[e(n)^2\bigr] \tag{4} \]

iteratively.

LMS Algorithm

Stochastic Gradient Approximation

The exact gradient of \(J(\mathbf{w})\) is

\[ \nabla J = -2\, E\bigl[e(n)\, \mathbf{x}(n)\bigr] \tag{5} \]

but the expectation is unknown in practice. The key idea of LMS is to replace the expectation with the instantaneous value:

\[ \hat{\nabla} J(n) = -2\, e(n)\, \mathbf{x}(n) \tag{6} \]

Substituting into steepest descent yields the LMS update rule:

\[ \mathbf{w}(n+1) = \mathbf{w}(n) + \mu\, e(n)\, \mathbf{x}(n) \tag{7} \]

Here \(\mu > 0\) is the step size (learning rate); the factor of 2 is absorbed into \(\mu\) .

Convergence Conditions

For mean convergence (\(E[\mathbf{w}(n)] \to \mathbf{w}^\ast\) ), the step size must satisfy

\[ 0 < \mu < \frac{2}{\lambda_{\max}} \tag{8} \]

where \(\lambda_{\max}\) is the largest eigenvalue of the input autocorrelation matrix \(\mathbf{R} = E[\mathbf{x}(n)\mathbf{x}(n)^\top]\) . A more conservative practical bound uses the input power:

\[ 0 < \mu < \frac{2}{L \cdot P_x}, \qquad P_x = E\bigl[x(n)^2\bigr] \tag{9} \]

which follows from \(\lambda_{\max} \le \mathrm{tr}(\mathbf{R}) = L P_x\) .

The qualitative behavior:

Step size \(\mu\)Behavior
SmallSlow convergence, low steady-state error
ModerateBalanced trade-off
LargeFast convergence, high steady-state error
Above boundDiverges (error grows unboundedly)

The eigenvalue spread \(\chi(\mathbf{R}) = \lambda_{\max} / \lambda_{\min}\) governs convergence speed: highly colored inputs converge slowly. See Autocorrelation Function for details.

NLMS Algorithm

Why Normalize

LMS convergence couples \(\mu\) with input power \(P_x\) . With non-stationary inputs (e.g., speech), a fixed \(\mu\) is simultaneously too small in quiet periods and too large in loud ones, which compromises stability.

NLMS (Normalized LMS) normalizes the step by the input vector norm:

\[ \mathbf{w}(n+1) = \mathbf{w}(n) + \frac{\tilde{\mu}}{\lVert \mathbf{x}(n) \rVert^2 + \varepsilon}\, e(n)\, \mathbf{x}(n) \tag{10} \]

The normalized step \(\tilde{\mu} \in (0, 2)\) no longer depends on input power. The small constant \(\varepsilon\) guards against division by tiny norms.

LMS vs NLMS

FeatureLMSNLMS
Cost\(O(L)\)\(O(L)\)
Step bound\(\mu < 2/(L P_x)\) (input-dependent)\(\tilde{\mu} < 2\) (input-free)
Robust to input swingWeakStrong
Implementation effortMinimalLMS plus a normalization term

Python Implementation

System Identification Example

We identify an unknown FIR system \(\mathbf{w}^\ast\) driven by white noise, using both LMS and NLMS, and plot the learning curves (squared error over iterations).

import numpy as np
import matplotlib.pyplot as plt

np.random.seed(0)

# --- Unknown system to identify ---
L = 16
w_true = np.array([0.5, -0.3, 0.2, 0.1, -0.05, 0.02,
                   0.0, 0.0, 0.0, 0.0, 0.0, 0.0,
                   0.0, 0.0, 0.0, 0.0])

# --- Input and observation ---
N = 5000
x = np.random.randn(N)
v = 0.05 * np.random.randn(N)        # measurement noise
d = np.convolve(x, w_true, mode='full')[:N] + v


def lms(x, d, L, mu):
    """Standard LMS algorithm."""
    N = len(x)
    w = np.zeros(L)
    e = np.zeros(N)
    for n in range(L, N):
        x_vec = x[n:n - L:-1]
        y = w @ x_vec
        e[n] = d[n] - y
        w = w + mu * e[n] * x_vec
    return w, e


def nlms(x, d, L, mu_tilde, eps=1e-6):
    """Normalized LMS algorithm."""
    N = len(x)
    w = np.zeros(L)
    e = np.zeros(N)
    for n in range(L, N):
        x_vec = x[n:n - L:-1]
        y = w @ x_vec
        e[n] = d[n] - y
        norm = x_vec @ x_vec + eps
        w = w + (mu_tilde / norm) * e[n] * x_vec
    return w, e


# --- Run ---
mu_lms = 0.01
mu_nlms = 0.5
w_lms, e_lms = lms(x, d, L, mu_lms)
w_nlms, e_nlms = nlms(x, d, L, mu_nlms)


# --- Learning curve (smoothed squared error) ---
def smooth(v, win=50):
    kernel = np.ones(win) / win
    return np.convolve(v ** 2, kernel, mode='same')


plt.figure(figsize=(10, 5))
plt.semilogy(smooth(e_lms), label=f'LMS (μ={mu_lms})')
plt.semilogy(smooth(e_nlms), label=f'NLMS (μ̃={mu_nlms})')
plt.xlabel('Iteration n')
plt.ylabel('Mean Squared Error')
plt.title('Learning Curves: LMS vs NLMS')
plt.legend()
plt.grid(True, alpha=0.3)
plt.tight_layout()
plt.show()

print(f"LMS   final coeff error: {np.linalg.norm(w_lms - w_true):.4f}")
print(f"NLMS  final coeff error: {np.linalg.norm(w_nlms - w_true):.4f}")

The log-scale curves show an exponential decay phase followed by a flat steady-state misadjustment floor. Increasing \(\mu\) (or \(\tilde{\mu}\) ) speeds up the initial decay but raises the floor — the canonical convergence-vs-misadjustment trade-off.

Step-Size Sensitivity

plt.figure(figsize=(10, 5))
for mu in [0.005, 0.02, 0.05, 0.1]:
    _, e = lms(x, d, L, mu)
    plt.semilogy(smooth(e), label=f'μ={mu}', alpha=0.8)
plt.xlabel('Iteration n')
plt.ylabel('Mean Squared Error')
plt.title('LMS: Sensitivity to Step Size μ')
plt.legend()
plt.grid(True, alpha=0.3)
plt.tight_layout()
plt.show()

Tiny \(\mu\) converges painfully slowly; values near the upper bound oscillate from the start. As a rule of thumb, choose at most half of the bound in (8).

Connection to the Wiener Filter

LMS converges in mean to the Wiener solution \(\mathbf{w}^\ast = \mathbf{R}^{-1}\mathbf{p}\) , with \(\mathbf{p} = E[d(n)\mathbf{x}(n)]\) :

\[ E\bigl[\mathbf{w}(n)\bigr] \xrightarrow{n \to \infty} \mathbf{w}^\ast \tag{11} \]

Because LMS uses an instantaneous gradient, a finite excess MSE remains. RLS (Recursive Least Squares) accelerates convergence at higher cost.

When to Use What

Use caseChoiceReason
Stationary input levelLMSSmallest implementation, lowest cost
Speech / time-varying signalsNLMSRobust to input-power swings
Fast convergence is criticalRLSQuadratic cost \(O(L^2)\) but quick
Embedded / low-resource targetsLMS / NLMS\(O(L)\) is sufficient

References

  • Haykin, S. (2014). Adaptive Filter Theory (5th ed.). Pearson. Chapters 5-6.
  • Widrow, B., & Stearns, S. D. (1985). Adaptive Signal Processing. Prentice Hall.
  • Sayed, A. H. (2008). Adaptive Filters. Wiley-IEEE Press.