k-means and GMM: Clustering Theory and Python Implementation

From k-means convergence guarantees to GMM with EM algorithm, scikit-learn implementation, and cluster number selection (elbow, silhouette, BIC).

Introduction

Clustering is a fundamental unsupervised learning task that partitions unlabeled data into groups. This article covers k-means, the most widely used method, and Gaussian Mixture Models (GMM), a probabilistic approach.

k-means Algorithm

Algorithm

  1. Initialize \(k\) centroids \(\boldsymbol{\mu}_1, \ldots, \boldsymbol{\mu}_k\) randomly
  2. Assignment step: Assign each point \(\mathbf{x}_i\) to the nearest centroid
\[c_i = \arg\min_{j} \|\mathbf{x}_i - \boldsymbol{\mu}_j\|^2 \tag{1}\]
  1. Update step: Recompute centroids
\[\boldsymbol{\mu}_j = \frac{1}{|C_j|} \sum_{i \in C_j} \mathbf{x}_i \tag{2}\]
  1. Repeat 2-3 until convergence

Objective Function

k-means minimizes the inertia:

\[J = \sum_{j=1}^{k} \sum_{i \in C_j} \|\mathbf{x}_i - \boldsymbol{\mu}_j\|^2 \tag{3}\]

\(J\) monotonically decreases at each step, guaranteeing convergence in finite iterations. However, global optimality is not guaranteed due to initialization dependence.

k-means++

Selects initial centroids with probability proportional to squared distance from existing centroids, mitigating the initialization problem. This is scikit-learn’s default.

Gaussian Mixture Model (GMM)

Model Definition

A mixture of \(k\) Gaussian distributions:

\[p(\mathbf{x}) = \sum_{j=1}^{k} \pi_j \mathcal{N}(\mathbf{x} | \boldsymbol{\mu}_j, \boldsymbol{\Sigma}_j) \tag{4}\]

where \(\pi_j\) are mixing coefficients (\(\sum_j \pi_j = 1\)) and \(\boldsymbol{\Sigma}_j\) are covariance matrices.

EM Algorithm

E-step: Compute responsibilities

\[\gamma_{ij} = \frac{\pi_j \mathcal{N}(\mathbf{x}_i | \boldsymbol{\mu}_j, \boldsymbol{\Sigma}_j)}{\sum_{l=1}^{k} \pi_l \mathcal{N}(\mathbf{x}_i | \boldsymbol{\mu}_l, \boldsymbol{\Sigma}_l)} \tag{5}\]

M-step: Update parameters

\[\boldsymbol{\mu}_j = \frac{\sum_i \gamma_{ij} \mathbf{x}_i}{\sum_i \gamma_{ij}}, \quad \boldsymbol{\Sigma}_j = \frac{\sum_i \gamma_{ij} (\mathbf{x}_i - \boldsymbol{\mu}_j)(\mathbf{x}_i - \boldsymbol{\mu}_j)^T}{\sum_i \gamma_{ij}} \tag{6}\]\[\pi_j = \frac{\sum_i \gamma_{ij}}{n} \tag{7}\]

k-means vs GMM

Featurek-meansGMM
AssignmentHard (0 or 1)Soft (probability)
Cluster shapeSphericalEllipsoidal (arbitrary covariance)
ObjectiveInertiaLog-likelihood
ComputationFastSlower

Python Implementation

import numpy as np
import matplotlib.pyplot as plt
from sklearn.datasets import make_blobs
from sklearn.cluster import KMeans
from sklearn.mixture import GaussianMixture
from sklearn.preprocessing import StandardScaler

X, y_true = make_blobs(n_samples=500, centers=[[-2, -2], [2, 2], [0, 4]],
                        cluster_std=[0.8, 1.2, 0.6], random_state=42)
scaler = StandardScaler()
X = scaler.fit_transform(X)

# --- k-means ---
kmeans = KMeans(n_clusters=3, init='k-means++', n_init=10, random_state=42)
km_labels = kmeans.fit_predict(X)

# --- GMM ---
gmm = GaussianMixture(n_components=3, covariance_type='full', random_state=42)
gmm.fit(X)
gmm_labels = gmm.predict(X)
gmm_probs = gmm.predict_proba(X)

# --- Visualization ---
fig, axes = plt.subplots(1, 3, figsize=(15, 4))

axes[0].scatter(X[:, 0], X[:, 1], c=km_labels, cmap='viridis', s=15, alpha=0.6)
axes[0].scatter(kmeans.cluster_centers_[:, 0], kmeans.cluster_centers_[:, 1],
                c='red', marker='X', s=200, edgecolors='k')
axes[0].set_title('k-means')

axes[1].scatter(X[:, 0], X[:, 1], c=gmm_labels, cmap='viridis', s=15, alpha=0.6)
axes[1].set_title('GMM (hard)')

uncertainty = 1 - gmm_probs.max(axis=1)
axes[2].scatter(X[:, 0], X[:, 1], c=gmm_labels, cmap='viridis',
                s=15, alpha=1 - uncertainty * 0.8)
axes[2].set_title('GMM (soft: uncertainty)')

for ax in axes:
    ax.set_xlabel('Feature 1')
    ax.set_ylabel('Feature 2')
    ax.grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

Cluster Number Selection

Elbow Method

inertias = []
K_range = range(1, 10)
for k in K_range:
    km = KMeans(n_clusters=k, n_init=10, random_state=42)
    km.fit(X)
    inertias.append(km.inertia_)

plt.plot(K_range, inertias, 'bo-')
plt.xlabel('k')
plt.ylabel('Inertia')
plt.title('Elbow Method')
plt.grid(True, alpha=0.3)
plt.show()

Silhouette Score

\[s(i) = \frac{b(i) - a(i)}{\max(a(i), b(i))} \tag{8}\]

where \(a(i)\) is the mean intra-cluster distance and \(b(i)\) is the mean nearest-cluster distance. \(s(i) \in [-1, 1]\); closer to 1 indicates better clustering.

BIC (for GMM)

bics = []
K_range = range(1, 10)
for k in K_range:
    gm = GaussianMixture(n_components=k, random_state=42)
    gm.fit(X)
    bics.append(gm.bic(X))

plt.plot(K_range, bics, 'ro-')
plt.xlabel('k')
plt.ylabel('BIC')
plt.title('BIC for GMM')
plt.grid(True, alpha=0.3)
plt.show()

Select \(k\) that minimizes BIC.

References

  • Bishop, C. M. (2006). Pattern Recognition and Machine Learning. Springer. Chapters 9.
  • MacQueen, J. (1967). “Some methods for classification and analysis of multivariate observations”. Proceedings of the 5th Berkeley Symposium.
  • scikit-learn Clustering documentation