k-means and GMM: Clustering Theory and Python Implementation

Introduction

Clustering is a fundamental unsupervised learning task that partitions unlabeled data into groups. This article covers k-means, the most widely used method, and Gaussian Mixture Models (GMM), a probabilistic approach.

k-means Algorithm

Algorithm

Initialize \(k\) centroids \(\boldsymbol{\mu}_1, \ldots, \boldsymbol{\mu}_k\) randomly
Assignment step: Assign each point \(\mathbf{x}_i\) to the nearest centroid

\[c_i = \arg\min_{j} \|\mathbf{x}_i - \boldsymbol{\mu}_j\|^2 \tag{1}\]

Update step: Recompute centroids

\[\boldsymbol{\mu}_j = \frac{1}{|C_j|} \sum_{i \in C_j} \mathbf{x}_i \tag{2}\]

Repeat 2-3 until convergence

Objective Function

k-means minimizes the inertia:

\[J = \sum_{j=1}^{k} \sum_{i \in C_j} \|\mathbf{x}_i - \boldsymbol{\mu}_j\|^2 \tag{3}\]

\(J\) monotonically decreases at each step, guaranteeing convergence in finite iterations. However, global optimality is not guaranteed due to initialization dependence.

k-means++

Selects initial centroids with probability proportional to squared distance from existing centroids, mitigating the initialization problem. This is scikit-learn’s default.

Gaussian Mixture Model (GMM)

Model Definition

A mixture of \(k\) Gaussian distributions:

\[p(\mathbf{x}) = \sum_{j=1}^{k} \pi_j \mathcal{N}(\mathbf{x} | \boldsymbol{\mu}_j, \boldsymbol{\Sigma}_j) \tag{4}\]

where \(\pi_j\) are mixing coefficients (\(\sum_j \pi_j = 1\)) and \(\boldsymbol{\Sigma}_j\) are covariance matrices.

EM Algorithm

E-step: Compute responsibilities

\[\gamma_{ij} = \frac{\pi_j \mathcal{N}(\mathbf{x}_i | \boldsymbol{\mu}_j, \boldsymbol{\Sigma}_j)}{\sum_{l=1}^{k} \pi_l \mathcal{N}(\mathbf{x}_i | \boldsymbol{\mu}_l, \boldsymbol{\Sigma}_l)} \tag{5}\]

M-step: Update parameters

\[\boldsymbol{\mu}_j = \frac{\sum_i \gamma_{ij} \mathbf{x}_i}{\sum_i \gamma_{ij}}, \quad \boldsymbol{\Sigma}_j = \frac{\sum_i \gamma_{ij} (\mathbf{x}_i - \boldsymbol{\mu}_j)(\mathbf{x}_i - \boldsymbol{\mu}_j)^T}{\sum_i \gamma_{ij}} \tag{6}\]\[\pi_j = \frac{\sum_i \gamma_{ij}}{n} \tag{7}\]

k-means vs GMM

Feature	k-means	GMM
Assignment	Hard (0 or 1)	Soft (probability)
Cluster shape	Spherical	Ellipsoidal (arbitrary covariance)
Objective	Inertia	Log-likelihood
Computation	Fast	Slower

Python Implementation

import numpy as np
import matplotlib.pyplot as plt
from sklearn.datasets import make_blobs
from sklearn.cluster import KMeans
from sklearn.mixture import GaussianMixture
from sklearn.preprocessing import StandardScaler

X, y_true = make_blobs(n_samples=500, centers=[[-2, -2], [2, 2], [0, 4]],
                        cluster_std=[0.8, 1.2, 0.6], random_state=42)
scaler = StandardScaler()
X = scaler.fit_transform(X)

# --- k-means ---
kmeans = KMeans(n_clusters=3, init='k-means++', n_init=10, random_state=42)
km_labels = kmeans.fit_predict(X)

# --- GMM ---
gmm = GaussianMixture(n_components=3, covariance_type='full', random_state=42)
gmm.fit(X)
gmm_labels = gmm.predict(X)
gmm_probs = gmm.predict_proba(X)

# --- Visualization ---
fig, axes = plt.subplots(1, 3, figsize=(15, 4))

axes[0].scatter(X[:, 0], X[:, 1], c=km_labels, cmap='viridis', s=15, alpha=0.6)
axes[0].scatter(kmeans.cluster_centers_[:, 0], kmeans.cluster_centers_[:, 1],
                c='red', marker='X', s=200, edgecolors='k')
axes[0].set_title('k-means')

axes[1].scatter(X[:, 0], X[:, 1], c=gmm_labels, cmap='viridis', s=15, alpha=0.6)
axes[1].set_title('GMM (hard)')

uncertainty = 1 - gmm_probs.max(axis=1)
axes[2].scatter(X[:, 0], X[:, 1], c=gmm_labels, cmap='viridis',
                s=15, alpha=1 - uncertainty * 0.8)
axes[2].set_title('GMM (soft: uncertainty)')

for ax in axes:
    ax.set_xlabel('Feature 1')
    ax.set_ylabel('Feature 2')
    ax.grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

Cluster Number Selection

Elbow Method

inertias = []
K_range = range(1, 10)
for k in K_range:
    km = KMeans(n_clusters=k, n_init=10, random_state=42)
    km.fit(X)
    inertias.append(km.inertia_)

plt.plot(K_range, inertias, 'bo-')
plt.xlabel('k')
plt.ylabel('Inertia')
plt.title('Elbow Method')
plt.grid(True, alpha=0.3)
plt.show()

Silhouette Score

\[s(i) = \frac{b(i) - a(i)}{\max(a(i), b(i))} \tag{8}\]

where \(a(i)\) is the mean intra-cluster distance and \(b(i)\) is the mean nearest-cluster distance. \(s(i) \in [-1, 1]\); closer to 1 indicates better clustering.

BIC (for GMM)

bics = []
K_range = range(1, 10)
for k in K_range:
    gm = GaussianMixture(n_components=k, random_state=42)
    gm.fit(X)
    bics.append(gm.bic(X))

plt.plot(K_range, bics, 'ro-')
plt.xlabel('k')
plt.ylabel('BIC')
plt.title('BIC for GMM')
plt.grid(True, alpha=0.3)
plt.show()

Select \(k\) that minimizes BIC.

Ensemble Learning: From Decision Trees to Random Forest and Gradient Boosting - Supervised learning methods; clustering results can serve as features.
Understanding Self-Attention from Scratch: Math and Python Implementation - Similarity between Attention’s weighted averaging and GMM’s mixture model.
Support Vector Machines (SVM): Kernel Methods and Nonlinear Classification with Python - Comparison with supervised classification; SVM finds boundaries while clustering discovers data structure.
Markov Chain Monte Carlo (MCMC): Metropolis-Hastings and Gibbs Sampling - MCMC can also estimate GMM parameters.
Bayesian Linear Regression: From Least Squares to Bayesian Estimation - Bayesian foundations underlying GMM.
From SGD to Adam: Evolution of Gradient-Based Optimization - Understanding the relationship between EM and gradient methods.

References

Bishop, C. M. (2006). Pattern Recognition and Machine Learning. Springer. Chapters 9.
MacQueen, J. (1967). “Some methods for classification and analysis of multivariate observations”. Proceedings of the 5th Berkeley Symposium.
scikit-learn Clustering documentation