Introduction
Clustering is a fundamental unsupervised learning task that partitions unlabeled data into groups. This article covers k-means, the most widely used method, and Gaussian Mixture Models (GMM), a probabilistic approach.
k-means Algorithm
Algorithm
- Initialize \(k\) centroids \(\boldsymbol{\mu}_1, \ldots, \boldsymbol{\mu}_k\) randomly
- Assignment step: Assign each point \(\mathbf{x}_i\) to the nearest centroid
- Update step: Recompute centroids
- Repeat 2-3 until convergence
Objective Function
k-means minimizes the inertia:
\[J = \sum_{j=1}^{k} \sum_{i \in C_j} \|\mathbf{x}_i - \boldsymbol{\mu}_j\|^2 \tag{3}\]\(J\) monotonically decreases at each step, guaranteeing convergence in finite iterations. However, global optimality is not guaranteed due to initialization dependence.
k-means++
Selects initial centroids with probability proportional to squared distance from existing centroids, mitigating the initialization problem. This is scikit-learn’s default.
Gaussian Mixture Model (GMM)
Model Definition
A mixture of \(k\) Gaussian distributions:
\[p(\mathbf{x}) = \sum_{j=1}^{k} \pi_j \mathcal{N}(\mathbf{x} | \boldsymbol{\mu}_j, \boldsymbol{\Sigma}_j) \tag{4}\]where \(\pi_j\) are mixing coefficients (\(\sum_j \pi_j = 1\)) and \(\boldsymbol{\Sigma}_j\) are covariance matrices.
EM Algorithm
E-step: Compute responsibilities
\[\gamma_{ij} = \frac{\pi_j \mathcal{N}(\mathbf{x}_i | \boldsymbol{\mu}_j, \boldsymbol{\Sigma}_j)}{\sum_{l=1}^{k} \pi_l \mathcal{N}(\mathbf{x}_i | \boldsymbol{\mu}_l, \boldsymbol{\Sigma}_l)} \tag{5}\]M-step: Update parameters
\[\boldsymbol{\mu}_j = \frac{\sum_i \gamma_{ij} \mathbf{x}_i}{\sum_i \gamma_{ij}}, \quad \boldsymbol{\Sigma}_j = \frac{\sum_i \gamma_{ij} (\mathbf{x}_i - \boldsymbol{\mu}_j)(\mathbf{x}_i - \boldsymbol{\mu}_j)^T}{\sum_i \gamma_{ij}} \tag{6}\]\[\pi_j = \frac{\sum_i \gamma_{ij}}{n} \tag{7}\]k-means vs GMM
| Feature | k-means | GMM |
|---|---|---|
| Assignment | Hard (0 or 1) | Soft (probability) |
| Cluster shape | Spherical | Ellipsoidal (arbitrary covariance) |
| Objective | Inertia | Log-likelihood |
| Computation | Fast | Slower |
Python Implementation
import numpy as np
import matplotlib.pyplot as plt
from sklearn.datasets import make_blobs
from sklearn.cluster import KMeans
from sklearn.mixture import GaussianMixture
from sklearn.preprocessing import StandardScaler
X, y_true = make_blobs(n_samples=500, centers=[[-2, -2], [2, 2], [0, 4]],
cluster_std=[0.8, 1.2, 0.6], random_state=42)
scaler = StandardScaler()
X = scaler.fit_transform(X)
# --- k-means ---
kmeans = KMeans(n_clusters=3, init='k-means++', n_init=10, random_state=42)
km_labels = kmeans.fit_predict(X)
# --- GMM ---
gmm = GaussianMixture(n_components=3, covariance_type='full', random_state=42)
gmm.fit(X)
gmm_labels = gmm.predict(X)
gmm_probs = gmm.predict_proba(X)
# --- Visualization ---
fig, axes = plt.subplots(1, 3, figsize=(15, 4))
axes[0].scatter(X[:, 0], X[:, 1], c=km_labels, cmap='viridis', s=15, alpha=0.6)
axes[0].scatter(kmeans.cluster_centers_[:, 0], kmeans.cluster_centers_[:, 1],
c='red', marker='X', s=200, edgecolors='k')
axes[0].set_title('k-means')
axes[1].scatter(X[:, 0], X[:, 1], c=gmm_labels, cmap='viridis', s=15, alpha=0.6)
axes[1].set_title('GMM (hard)')
uncertainty = 1 - gmm_probs.max(axis=1)
axes[2].scatter(X[:, 0], X[:, 1], c=gmm_labels, cmap='viridis',
s=15, alpha=1 - uncertainty * 0.8)
axes[2].set_title('GMM (soft: uncertainty)')
for ax in axes:
ax.set_xlabel('Feature 1')
ax.set_ylabel('Feature 2')
ax.grid(True, alpha=0.3)
plt.tight_layout()
plt.show()
Cluster Number Selection
Elbow Method
inertias = []
K_range = range(1, 10)
for k in K_range:
km = KMeans(n_clusters=k, n_init=10, random_state=42)
km.fit(X)
inertias.append(km.inertia_)
plt.plot(K_range, inertias, 'bo-')
plt.xlabel('k')
plt.ylabel('Inertia')
plt.title('Elbow Method')
plt.grid(True, alpha=0.3)
plt.show()
Silhouette Score
\[s(i) = \frac{b(i) - a(i)}{\max(a(i), b(i))} \tag{8}\]where \(a(i)\) is the mean intra-cluster distance and \(b(i)\) is the mean nearest-cluster distance. \(s(i) \in [-1, 1]\); closer to 1 indicates better clustering.
BIC (for GMM)
bics = []
K_range = range(1, 10)
for k in K_range:
gm = GaussianMixture(n_components=k, random_state=42)
gm.fit(X)
bics.append(gm.bic(X))
plt.plot(K_range, bics, 'ro-')
plt.xlabel('k')
plt.ylabel('BIC')
plt.title('BIC for GMM')
plt.grid(True, alpha=0.3)
plt.show()
Select \(k\) that minimizes BIC.
Related Articles
- Support Vector Machines (SVM) - Comparison with supervised classification; SVM finds boundaries while clustering discovers data structure.
- Bayesian Linear Regression Fundamentals - Bayesian foundations underlying GMM.
- Markov Chain Monte Carlo (MCMC) - MCMC can also estimate GMM parameters.
- SGD to Adam: Gradient Descent Theory and Implementation - Understanding the relationship between EM and gradient methods.
References
- Bishop, C. M. (2006). Pattern Recognition and Machine Learning. Springer. Chapters 9.
- MacQueen, J. (1967). “Some methods for classification and analysis of multivariate observations”. Proceedings of the 5th Berkeley Symposium.
- scikit-learn Clustering documentation