K-means fundamentals

Given $n$ points $\bx_1,\dots,\bx_n \in \real^d$, the goal of K-means is find $K$ centers $\bc_1,\dots,\bc_m \in \real^d$ and assignments $q_1,\dots,q_n \in \{1,\dots,K\}$ of the points to the centers such that the sum of distances

$E(\bc_1,\dots,\bc_k,q_1,\dots,q_n) = \sum_{i=1}^n \|\bx_i - \bc_{q_i} \|_p^p$

is minimized. $K$-means is obtained for the case $p=2$ ( $l^2$ norm), because in this case the optimal centers are the means of the input vectors assigned to them. Here the generalization $p=1$ ( $l^1$ norm) will also be considered.

Up to normalization, the K-means objective $E$ is also the average reconstruction error if the original points are approximated with the cluster centers. Thus K-means is used not only to group the input points into cluster, but also to quantize their values.

K-means is widely used in computer vision, for example in the construction of vocabularies of visual features (visual words). In these applications the number $n$ of points to cluster and/or the number $K$ of clusters is often large. Unfortunately, minimizing the objective $E$ is in general a difficult combinatorial problem, so locally optimal or approximated solutions are sought instead.

The basic K-means algorithm alternate between re-estimating the centers and the assignments (Lloyd's algorithm). Combined with a good initialization strategy (Initialization methods) and, potentially, by re-running the optimization from a number of randomized starting states, this algorithm may attain satisfactory solutions in practice.

However, despite its simplicity, Lloyd's algorithm is often too slow. A good replacement is Elkan's algorithm (Elkan's algorithm), which uses the triangular inequality to cut down significantly the cost of Lloyd's algorithm. Since this algorithm is otherwise equivalent, it should often be preferred.

For very large problems (millions of point to clusters and hundreds, thousands, or more clusters to find), even Elkan's algorithm is not sufficiently fast. In these cases, one can resort to a variant of Lloyd's algorithm that uses an approximated nearest neighbors routine (ANN algorithm).

# Initialization methods

All the $K$-means algorithms considered here find locally optimal solutions; as such the way they are initialized is important. kmeans.h supports the following initialization algorithms:

Random data samples

The simplest initialization method is to sample $K$ points at random from the input data and use them as initial values for the cluster centers.

K-means++

[3] proposes a randomized initialization of the centers which improves upon random selection. The first center $\bc_1$ is selected at random from the data points $\bx_1, \dots, \bx_n$ and the distance from this center to all points $\|\bx_i - \bc_1\|_p^p$ is computed. Then the second center $\bc_2$ is selected at random from the data points with probability proportional to the distance. The procedure is repeated to obtain the other centers by using the minimum distance to the centers collected so far.

# Lloyd's algorithm

The most common K-means method is Lloyd's algorithm [16] . This algorithm is based on the observation that, while jointly optimizing clusters and assignment is difficult, optimizing one given the other is easy. Lloyd's algorithm alternates the steps:

1. Quantization. Each point $\bx_i$ is reassigned to the center $\bc_{q_j}$ closer to it. This requires finding for each point the closest among $K$ other points, which is potentially slow.
2. Center estimation. Each center $\bc_q$ is updated to minimize its average distances to the points assigned to it. It is easy to show that the best center is the mean or median of the points, respectively if the $l^2$ or $l^1$ norm is considered.

A naive implementation of the assignment step requires $O(dnK)$ operations, where $d$ is the dimensionality of the data, $n$ the number of data points, and $K$ the number of centers. Updating the centers is much cheaper: $O(dn)$ operations suffice to compute the $K$ means and a slightly higher cost is required for the medians. Clearly, the bottleneck is the assignment computation, and this is what the other K-means algorithm try to improve.

During the iterations, it can happen that a cluster becomes empty. In this case, K-means automatically “restarts” the cluster center by selecting a training point at random.

# Elkan's algorithm

Elkan's algorithm [7] is a variation of Lloyd alternate optimization algorithm (Lloyd's algorithm) that uses the triangular inequality to avoid many distance calculations when assigning points to clusters. While much faster than Lloyd, Elkan's method uses storage proportional to the umber of clusters by data points, which makes it unpractical for a very large number of clusters.

The idea of this algorithm is that, if a center update does not move them much, then most of the point-to-center computations can be avoided when the point-to-center assignments are recomputed. To detect which distances need evaluation, the triangular inequality is used to lower and upper bound distances after a center update.

Elkan algorithms uses two key observations. First, one has

$\|\bx_i - \bc_{q_i}\|_p \leq \|\bc - \bc_{q_i}\|_p / 2 \quad\Rightarrow\quad \|\bx_i - \bc_{q_i}\|_p \leq \|\bx_i - \bc\|_p.$

Thus if the distance between $\bx_i$ and its current center $\bc_{q_i}$ is less than half the distance of the center $\bc_{q_i}$ to another center $\bc$, then $\bc$ can be skipped when the new assignment for $\bx_i$ is searched. Checking this requires keeping track of all the inter-center distances, but centers are typically a small fraction of the training data, so overall this can be a significant saving. In particular, if this condition is satisfied for all the centers $\bc \not= \bc_{q_i}$, the point $\bx_i$ can be skipped completely. Furthermore, the condition can be tested also based on an upper bound $UB_i$ of $\|\bx_i - \bc_{q_i}\|_p$.

Second, if a center $\bc$ is updated to $\hat{\bc}$, then the new distance from $\bx$ to $\hat{\bc}$ is bounded from below and above by

$\|\bx - \bc\|_p - \|bc - \hat\bc\|_p \leq \|\bx - \hat{\bc}\|_p \leq \|\bx - \hat{\bc}\|_p + \|\bc + \hat{\bc}\|_p.$

This allows to maintain an upper bound on the distance of $\bx_i$ to its current center $\bc_{q_i}$ and a lower bound to any other center $\bc$:

\begin{align*} UB_i & \leftarrow UB_i + \|\bc_{q_i} - \hat{\bc}_{q_i} \|_p \\ LB_i(\bc) & \leftarrow LB_i(\bc) - \|\bc -\hat \bc\|_p. \end{align*}

Thus the K-means algorithm becomes:

1. Initialization. Compute $LB_i(\bc) = \|\bx_i -\hat \bc\|_p$ for all points and centers. Find the current assignments $q_i$ and bounds $UB_i$ by finding the closest centers to each point: $UB_i = \min_{\bc} LB_i(\bc)$.
2. Center estimation.
1. Recompute all the centers based on the new means; call the updated version $\hat{\bc}$.
2. Update all the bounds based on the distance $\|\bc - \hat\bc\|_p$ as explained above.
3. Set $\bc \leftarrow \hat\bc$ for all the centers and go to the next iteration.
3. Quantization.
1. Skip any point $\bx_i$ such that $UB_i \leq \frac{1}{2} \|\bc_{q_i} - \bc\|_p$ for all centers $\bc \not= \bc_{q_i}$.
2. For each remaining point $\bx_i$ and center $\bc \not= \bc_{q_i}$:
1. Skip $\bc$ if

$UB_i \leq \frac{1}{2} \| \bc_{q_i} - \bc \| \quad\text{or}\quad UB_i \leq LB_i(\bc).$

The first condition reflects the first observation above; the second uses the bounds to decide if $\bc$ can be closer than the current center $\bc_{q_i}$ to the point $\bx_i$. If the center cannot be skipped, continue as follows.
2. Skip $\bc$ if the condition above is satisfied after making the upper bound tight:

$UB_i = LB_i(\bc_{q_i}) = \| \bx_i - \bc_{q_i} \|_p.$

Note that the latter calculation can be done only once for $\bx_i$. If the center cannot be skipped still, continue as follows.
3. Tighten the lower bound too:

$LB_i(\bc) = \| \bx_i - \bc \|_p.$

At this point both $UB_i$ and $LB_i(\bc)$ are tight. If $LB_i < UB_i$, then the point $\bx_i$ should be reassigned to $\bc$. Update $q_i$ to the index of center $\bc$ and reset $UB_i = LB_i(\bc)$.

# ANN algorithm

The Approximate Nearest Neighbor (ANN) K-means algorithm [4] [27] [21] is a variant of Lloyd's algorithm (Lloyd's algorithm) uses a best-bin-first randomized KD-tree algorithm to approximately (and quickly) find the closest cluster center to each point. The KD-tree implementation is based on KD-trees and forests.

The algorithm can be summarized as follows:

1. Quantization. Each point $\bx_i$ is reassigned to the center $\bc_{q_j}$ closer to it. This starts by indexing the $K$ centers by a KD-tree and then using the latter to quickly find the closest center for every training point. The search is approximated to further improve speed. This opens up the possibility that a data point may receive an assignment that is worse than the current one. This is avoided by checking that the new assignment estimated by using ANN is an improvement; otherwise the old assignment is kept.
2. Center estimation. Each center $\bc_q$ is updated to minimize its average distances to the points assigned to it. It is easy to show that the best center is the mean or median of the points, respectively if the $l^2$ or $l^1$ norm is considered.

The key is to trade-off carefully the speedup obtained by using the ANN algorithm and the loss in accuracy when retrieving neighbors. Due to the curse of dimensionality, KD-trees become less effective for higher dimensional data, so that the search cost, which in the best case is logarithmic with this data structure, may become effectively linear. This is somehow mitigated by the fact that new a new KD-tree is computed at each iteration, reducing the likelihood that points may get stuck with sub-optimal assignments.

Experiments with the quantization of 128-dimensional SIFT features show that the ANN algorithm may use one quarter of the comparisons of Elkan's while retaining a similar solution accuracy.