Short-Time Fourier Transform (STFT): Theory and Python Implementation

A systematic introduction to the Short-Time Fourier Transform (STFT): mathematical foundations, spectrogram generation and interpretation, the time-frequency resolution trade-off, and Python implementation.

Introduction

The Fast Fourier Transform (FFT) is a powerful technique for analyzing the frequency content of an entire signal at once, but it suffers from a fundamental limitation: time information is lost. When we need to track how a melody evolves over time, or how a machine’s vibration pattern changes, we require the knowledge of “which frequencies are present at which time.”

The Short-Time Fourier Transform (STFT) has been the standard approach to this problem since the 1940s. By dividing a signal into short time windows and applying the FFT to each window, it produces a two-dimensional time-frequency representation known as a spectrogram.

This article provides a detailed treatment of the mathematical definition of STFT and the time-frequency resolution trade-off, followed by Python implementations and guidance on reading spectrograms.

Mathematical Definition of STFT

Continuous-Time STFT

The Short-Time Fourier Transform of a continuous signal \(x(t)\) is defined as:

\[\text{STFT}\{x\}(\tau, f) = X(\tau, f) = \int_{-\infty}^{\infty} x(t) \cdot w(t - \tau) \cdot e^{-j2\pi ft} \, dt \tag{1}\]

where:

  • \(w(t)\) : window function (a finite-duration localization function)
  • \(\tau\) : center time of the window (time-shift parameter)
  • \(f\) : frequency

\(w(t - \tau)\) is the window translated to be centered at time \(\tau\) , which extracts a local segment of the signal for Fourier analysis. The result \(X(\tau, f)\) is complex-valued: its magnitude \(|X(\tau, f)|\) represents the amplitude of frequency component \(f\) at time \(\tau\) , while the argument \(\angle X(\tau, f)\) represents the phase.

Discrete-Time STFT

In practice we work with discrete-time signals. The discrete STFT of a signal \(x[n]\) (\(n = 0, 1, \ldots, N-1\) ) sampled at frequency \(f_s\) is defined as:

\[X[m, k] = \sum_{n=-\infty}^{\infty} x[n] \cdot w[n - mH] \cdot e^{-j2\pi kn/L} \tag{2}\]

where:

  • \(m\) : frame index (time axis)
  • \(k\) : frequency bin index (\(k = 0, 1, \ldots, L-1\) )
  • \(H\) : hop length (frame shift) — the number of samples between consecutive frame start positions
  • \(L\) : window length (FFT size)
  • \(w[n]\) : window function of length \(L\)

Each frame is extracted using a window of length \(L\) centered at time \(mH\) , and the FFT is applied to the windowed segment.

Overlap and Hop Length

The overlap between consecutive frames is defined as:

\[\text{Overlap} = 1 - \frac{H}{L} \tag{3}\]

For example, with \(L = 1024\) and \(H = 512\) , the overlap is 50%. Higher overlap improves time resolution but increases computational cost. In practice, overlaps of 50%–75% are most common.

The Time-Frequency Resolution Trade-Off

The most important constraint of STFT is the time-frequency resolution trade-off, which is mathematically equivalent to the Heisenberg uncertainty principle in quantum mechanics.

The Gabor Limit

There is a lower bound on the product of time resolution \(\Delta t\) and frequency resolution \(\Delta f\) :

\[\Delta t \cdot \Delta f \geq \frac{1}{4\pi} \tag{4}\]

This is known as the uncertainty principle in signal processing.

Window Length and the Trade-Off

The window length \(L\) (or \(T_w = L / f_s\) in seconds) determines the resolution characteristics:

Window length \(T_w\)Time resolution \(\Delta t\)Frequency resolution \(\Delta f\)
Short (small \(L\) )HighLow (adjacent frequencies hard to separate)
Long (large \(L\) )LowHigh (fine frequency separation possible)

As a concrete example, for an audio signal at \(f_s = 44100\) Hz:

  • \(L = 512\) (~12 ms): sensitive to temporal changes, but frequency resolution is ~86 Hz
  • \(L = 4096\) (~93 ms): frequency resolution is ~11 Hz, but slow to track temporal changes

The frequency resolution is given by \(\Delta f = f_s / L\) .

Python Implementation

Basic STFT Implementation from Scratch

We first implement STFT manually to understand its mechanics.

import numpy as np
import matplotlib.pyplot as plt
from scipy.signal import stft, istft
from scipy.signal.windows import hann

def compute_stft_manual(x, window, hop_length, n_fft):
    """
    Manual implementation of STFT.

    Parameters
    ----------
    x : np.ndarray
        Input signal (1D)
    window : np.ndarray
        Window function (length n_fft)
    hop_length : int
        Frame shift (in samples)
    n_fft : int
        FFT size (= window length)

    Returns
    -------
    stft_matrix : np.ndarray (complex)
        Complex spectrum matrix of shape (n_fft//2 + 1, n_frames)
    """
    n_frames = 1 + (len(x) - n_fft) // hop_length
    stft_matrix = np.zeros((n_fft // 2 + 1, n_frames), dtype=complex)

    for m in range(n_frames):
        # Extract frame
        start = m * hop_length
        frame = x[start : start + n_fft]
        # Apply window function
        windowed_frame = frame * window
        # FFT (positive frequencies only)
        stft_matrix[:, m] = np.fft.rfft(windowed_frame)

    return stft_matrix


# --- Generate test signal ---
fs = 4000          # Sampling frequency [Hz]
duration = 2.0     # Signal duration [s]
t = np.arange(0, duration, 1/fs)

# Time-varying signal (chirp + steady tone)
# 0–1 s: chirp linearly sweeping from 100 Hz to 500 Hz
# 1–2 s: steady 300 Hz sinusoid
chirp = np.zeros_like(t)
segment1 = t < 1.0
chirp[segment1] = np.sin(2 * np.pi * (100 + 200 * t[segment1]) * t[segment1])
chirp[~segment1] = np.sin(2 * np.pi * 300 * t[~segment1])
# Add white noise
np.random.seed(42)
signal = chirp + 0.1 * np.random.randn(len(t))

# --- STFT parameters ---
n_fft = 512
hop_length = n_fft // 4   # 75% overlap
window = hann(n_fft)

# --- Compute STFT manually ---
stft_matrix = compute_stft_manual(signal, window, hop_length, n_fft)

# Compute spectrogram (magnitude)
spectrogram = np.abs(stft_matrix)

# Compute time and frequency axes
n_frames = spectrogram.shape[1]
times = np.arange(n_frames) * hop_length / fs
freqs = np.fft.rfftfreq(n_fft, 1/fs)

# --- Plot ---
fig, axes = plt.subplots(2, 1, figsize=(12, 8))

# Original signal
axes[0].plot(t, signal, linewidth=0.5)
axes[0].set_xlabel('Time [s]')
axes[0].set_ylabel('Amplitude')
axes[0].set_title('Input Signal: Chirp (100→500 Hz) + 300 Hz')
axes[0].grid(True, alpha=0.3)

# Spectrogram
img = axes[1].pcolormesh(
    times, freqs, 20 * np.log10(spectrogram + 1e-10),
    shading='gouraud', cmap='inferno', vmin=-60, vmax=0
)
axes[1].set_xlabel('Time [s]')
axes[1].set_ylabel('Frequency [Hz]')
axes[1].set_title(f'Spectrogram (n_fft={n_fft}, hop={hop_length})')
axes[1].set_ylim(0, 800)
plt.colorbar(img, ax=axes[1], label='Magnitude [dB]')

plt.tight_layout()
plt.show()

Efficient Implementation with scipy.signal.stft

For real projects, using scipy.signal.stft is recommended.

import numpy as np
import matplotlib.pyplot as plt
from scipy.signal import stft, istft
from scipy.signal.windows import hann

# --- Test signal (simulating speech-like signal) ---
fs = 8000
t = np.arange(0, 3.0, 1/fs)

# Construct a time-varying signal:
# 0–1 s: 200 Hz
# 1–2 s: 200 Hz + 500 Hz
# 2–3 s: 800 Hz
x = np.zeros_like(t)
mask1 = t < 1.0
mask2 = (t >= 1.0) & (t < 2.0)
mask3 = t >= 2.0
x[mask1] = np.sin(2 * np.pi * 200 * t[mask1])
x[mask2] = (np.sin(2 * np.pi * 200 * t[mask2])
            + 0.7 * np.sin(2 * np.pi * 500 * t[mask2]))
x[mask3] = np.sin(2 * np.pi * 800 * t[mask3])
x += 0.05 * np.random.randn(len(t))

# --- Compare STFTs with different window sizes ---
n_fft_list = [128, 512, 2048]
fig, axes = plt.subplots(len(n_fft_list), 1, figsize=(12, 10))

for ax, n_fft in zip(axes, n_fft_list):
    hop = n_fft // 4
    f, t_stft, Zxx = stft(x, fs=fs, window='hann',
                           nperseg=n_fft, noverlap=n_fft - hop)
    magnitude_dB = 20 * np.log10(np.abs(Zxx) + 1e-10)

    img = ax.pcolormesh(t_stft, f, magnitude_dB,
                        shading='gouraud', cmap='inferno',
                        vmin=-60, vmax=0)
    ax.set_ylabel('Frequency [Hz]')
    ax.set_ylim(0, 1200)
    df = fs / n_fft
    dt_ms = hop / fs * 1000
    ax.set_title(
        f'n_fft={n_fft}: Δf={df:.1f} Hz, Δt={dt_ms:.1f} ms '
        f'(time-freq tradeoff)'
    )
    plt.colorbar(img, ax=ax, label='dB')

axes[-1].set_xlabel('Time [s]')
plt.tight_layout()
plt.show()

From these plots, we can observe that \(n\_fft = 128\) renders temporal transitions sharply but provides coarse frequency separation, while \(n\_fft = 2048\) achieves fine frequency resolution at the cost of temporal blurring at segment boundaries.

Signal Reconstruction via Inverse STFT

The STFT can be inverted via the Inverse STFT (ISTFT) to recover the original signal. With appropriate window function and hop length choices, perfect reconstruction is achievable:

\[x[n] = \frac{\sum_{m} X[m, k] \cdot w[n - mH]}{\sum_{m} w^2[n - mH]} \tag{5}\]
from scipy.signal import stft, istft

# Compute STFT
n_fft = 512
hop = n_fft // 4
f, t_stft, Zxx = stft(x, fs=fs, window='hann',
                       nperseg=n_fft, noverlap=n_fft - hop)

# Reconstruct signal via inverse STFT
_, x_reconstructed = istft(Zxx, fs=fs, window='hann',
                            nperseg=n_fft, noverlap=n_fft - hop)

# Check reconstruction error
min_len = min(len(x), len(x_reconstructed))
error = np.max(np.abs(x[:min_len] - x_reconstructed[:min_len]))
print(f"Maximum reconstruction error: {error:.2e}")
# → approximately 1e-14 (near-perfect reconstruction)

Reading a Spectrogram

Here we summarize the key points for interpreting spectrograms.

Meaning of Axes and Color

Axis / ElementMeaningUnit
Horizontal axisTime (center time of each frame)s
Vertical axisFrequencyHz
Color (intensity)Amplitude or log-scale powerdB

Common color maps include inferno (dark to bright: low to high power), viridis, and jet.

Importance of the dB Scale

Displaying amplitude on a linear scale causes strong components to dominate and weak components to become invisible. Using the logarithmic (dB) scale covers a wide dynamic range:

\[\text{Magnitude [dB]} = 20 \log_{10}(|X[m, k]| + \epsilon) \tag{6}\]

where \(\epsilon\) is a small constant for numerical stability (e.g., \(10^{-10}\) ).

Identifying Harmonics

In spectrograms of speech or musical instruments, the fundamental frequency \(f_0\) and its integer multiples — the harmonics (\(2f_0, 3f_0, \ldots\) ) — appear as horizontal stripes. This structure can be leveraged for pitch estimation.

Practical Application: Audio Signal Analysis

import numpy as np
import matplotlib.pyplot as plt
from scipy.signal import stft
from scipy.io import wavfile

# --- Create a synthetic voiced sound ---
fs = 16000
duration = 1.5
t = np.arange(0, duration, 1/fs)

# Simulate the vowel "a": F0=120 Hz, formants F1=800 Hz, F2=1200 Hz
np.random.seed(0)
glottal = np.zeros_like(t)
T0 = int(fs / 120)  # Fundamental period
for i in range(0, len(t), T0):
    if i + 20 < len(t):
        glottal[i:i+20] = np.exp(-np.arange(20) * 0.3)  # Approximate glottal pulse

# Formant filter (simplified)
from scipy.signal import lfilter, butter

def formant_filter(signal, center_freq, bandwidth, fs):
    """Simulate a formant using a simple bandpass filter"""
    b, a = butter(2, [center_freq - bandwidth/2, center_freq + bandwidth/2],
                  btype='band', fs=fs)
    return lfilter(b, a, signal)

voiced = (formant_filter(glottal, 800, 200, fs)
          + 0.7 * formant_filter(glottal, 1200, 150, fs)
          + 0.4 * formant_filter(glottal, 2500, 200, fs))
voiced = voiced / np.max(np.abs(voiced) + 1e-8)

# --- STFT and spectrogram ---
n_fft = 512
hop = n_fft // 8  # 87.5% overlap (common in speech processing)

f, t_stft, Zxx = stft(voiced, fs=fs, window='hann',
                       nperseg=n_fft, noverlap=n_fft - hop)

fig, axes = plt.subplots(2, 1, figsize=(12, 8))

axes[0].plot(t, voiced, linewidth=0.5)
axes[0].set_xlabel('Time [s]')
axes[0].set_ylabel('Amplitude')
axes[0].set_title('Synthetic Voiced Sound (vowel "a" approximation)')
axes[0].grid(True, alpha=0.3)

img = axes[1].pcolormesh(
    t_stft, f,
    20 * np.log10(np.abs(Zxx) + 1e-10),
    shading='gouraud', cmap='inferno', vmin=-60, vmax=0
)
axes[1].set_xlabel('Time [s]')
axes[1].set_ylabel('Frequency [Hz]')
axes[1].set_ylim(0, 4000)
axes[1].set_title('Spectrogram: Formants visible at ~800 Hz, 1200 Hz, 2500 Hz')
plt.colorbar(img, ax=axes[1], label='dB')

plt.tight_layout()
plt.show()

The resulting spectrogram shows horizontal bands (formants) near 800 Hz, 1200 Hz, and 2500 Hz, along with the harmonic series at integer multiples of the 120 Hz fundamental frequency.

Limitations of STFT and Wavelet Transform

Because the STFT uses a fixed window length, the time-frequency resolution trade-off is uniform across all frequencies. While low-frequency components fit many cycles within a fixed window and can be analyzed adequately, this fixed-window constraint can be problematic for high-frequency components.

The Wavelet Transform addresses this limitation through frequency-dependent variable window lengths: long windows at low frequencies (high frequency resolution) and short windows at high frequencies (high time resolution), yielding a more efficient time-frequency representation.

MethodTime resolutionFrequency resolutionWindowPrimary applications
FFTNone (global)High (fixed)Entire signalStationary signal spectral analysis
STFTModerate (window-dependent)ModerateFixedSpeech and music analysis
Wavelet TransformAdaptiveAdaptiveFrequency-dependentTransient phenomena, scale analysis

Practical Parameter Selection Guide

ApplicationRecommended \(n\_fft\)Recommended overlapRationale
Speech recognition (MFCC preprocessing)512–102450–75%Balance between formant resolution and temporal tracking
Music analysis (pitch detection)2048–409675%Requires ~10 Hz pitch resolution
Machinery vibration diagnostics1024–409650%Sufficient frequency resolution to separate harmonics
Real-time processing256–51250%Low latency is the priority
Anomaly detection (impact events)128–25675%High time resolution for transient detection

Summary

  • STFT produces a two-dimensional spectral representation (spectrogram) carrying both time and frequency information, by dividing a signal into short windows and applying the FFT to each
  • The time-frequency resolution trade-off (uncertainty principle) is a fundamental constraint: shortening the window improves time resolution at the expense of frequency resolution
  • Window functions are essential for controlling spectral leakage; the Hann window is the standard choice in speech processing
  • The Inverse STFT (ISTFT) can recover the original signal with near-perfect fidelity under appropriate parameter choices
  • Displaying the spectrogram in dB scale covers a wide dynamic range and enables visual identification of harmonics, formants, and frequency changes
  • When simultaneous high time and frequency resolution is required, the Wavelet Transform should be considered as an alternative

References

  • Allen, J. B. (1977). “Short term spectral analysis, synthesis, and modification by discrete Fourier transform.” IEEE Transactions on Acoustics, Speech, and Signal Processing, 25(3), 235-238.
  • Griffin, D., & Lim, J. (1984). “Signal estimation from modified short-time Fourier transform.” IEEE Transactions on Acoustics, Speech, and Signal Processing, 32(2), 236-243.
  • Oppenheim, A. V., & Schafer, R. W. (2009). Discrete-Time Signal Processing (3rd ed.). Prentice Hall.
  • SciPy Signal Processing — stft
  • Librosa: Audio and Music Analysis