### Table of Contents

kmeans.h implements a number of algorithm for **K-means quantization**: Lloyd [16] , an accelerated version by Elkan [7] , and a large scale algorithm based on Approximate Nearest Neighbors (ANN). All algorithms support `float`

or `double`

data and can use the \(l^1\) or the \(l^2\) distance for clustering. Furthermore, all algorithms can take advantage of multiple CPU cores.

Please see K-means fundamentals for a technical description of K-means and of the algorithms implemented here.

# Getting started

The goal of K-means is to partition a dataset into \(K\) “compact” clusters. The following example demonstrates using kmeans.h in the C programming language to partition `numData`

`float`

vectors into compute `numCenters`

clusters using Lloyd's algorithm:

Once the centers have been obtained, new data points can be assigned to clusters by using the vl_kmeans_quantize function:

Alternatively, one can directly assign new pointers to the closest centers, without bothering with a VlKMeans object.

There are several considerations that may impact the performance of KMeans. First, since K-means is usually based local optimization algorithm, the **initialization method** is important. The following initialization methods are supported:

Method | Function | Description |
---|---|---|

Random samples | vl_kmeans_init_centers_with_rand_data | Random data points |

K-means++ | vl_kmeans_init_centers_plus_plus | Random selection biased towards diversity |

Custom | vl_kmeans_set_centers | Choose centers (useful to run quantization only) |

See Initialization methods for further details. The initialization methods use a randomized selection of the data points; the random number generator init is controlled by vl_rand_init.

The second important choice is the **optimization algorithm**. The following optimization algorithms are supported:

Algorithm | Symbol | See | Description |
---|---|---|---|

Lloyd | VlKMeansLloyd | Lloyd's algorithm | Alternate EM-style optimization |

Elkan | VlKMeansElkan | Elkan's algorithm | A speedup using triangular inequalities |

ANN | VlKMeansANN | ANN algorithm | A speedup using approximated nearest neighbors |

See the relative sections for further details. These algorithm are iterative, and stop when either a **maximum number of iterations** (vl_kmeans_set_max_num_iterations) is reached, or when the energy changes sufficiently slowly in one iteration (vl_kmeans_set_min_energy_variation).

All the three algorithms support multithreaded computations. The number of threads used is usually controlled globally by vl_set_num_threads.