K-means is a type of non-hierarchical clustering. Non-hierarchical clustering partitions given data into a predetermined number of clusters $K$, seeking a grouping where each cluster contains elements with similar properties. Unlike hierarchical clustering, it does not produce a hierarchical structure, making it well suited for analyzing large-scale (big data) datasets.
K-means measures the distance between data points and cluster centers (centroids), and partitions the data into $K$ clusters so that these distances are minimized. The centroid $c_j$ of each cluster $C_j$ is computed as the mean of the data points belonging to that cluster.
$$ c_j = \frac{1}{|C_j|} \sum_{x \in C_j} x $$
K-means performs clustering by minimizing the following objective function. This function represents the sum of squared distances from each data point to the centroid of the cluster it belongs to, and takes smaller values when within-cluster cohesion is high.
$$ J = \sum_{j=1}^K \sum_{x \in C_j} ||x - c_j||^2 $$
The K-Means Algorithm
- Initialization: Randomly select $K$ points from the data as initial cluster centroids.
- Assignment step (E-step): For each data point, find the nearest cluster centroid and assign the data point to that cluster.
- Update step (M-step): Compute the mean of all data points assigned to each cluster and set it as the new cluster centroid.
- Convergence check: Repeat steps 2 and 3 until the centroids no longer move or a predetermined maximum number of iterations is reached.
K-means is simple and fast, which makes it widely used. However, it has several known limitations: results depend on the choice of initial centroids, it assumes spherical cluster shapes, and the number of clusters $K$ must be specified in advance.