K-means clustering

kmeans.h implements a number of algorithm for K-means quantization: Lloyd [16] , an accelerated version by Elkan [7] , and a large scale algorithm based on Approximate Nearest Neighbors (ANN). All algorithms support float or double data and can use the $l^1$ or the $l^2$ distance for clustering. Furthermore, all algorithms can take advantage of multiple CPU cores.

Please see K-means fundamentals for a technical description of K-means and of the algorithms implemented here.

# Getting started

The goal of K-means is to partition a dataset into $K$ “compact” clusters. The following example demonstrates using kmeans.h in the C programming language to partition numData float vectors into compute numCenters clusters using Lloyd's algorithm:

#include <vl/kmeans.h>
double energy ;
double * centers ;
// Use float data and the L2 distance for clustering
KMeans * kmeans = vl_kmeans_new (VLDistanceL2, VL_TYPE_FLOAT) ;
// Use Lloyd algorithm
// Initialize the cluster centers by randomly sampling the data
vl_kmeans_init_centers_with_rand_data (kmeans, data, dimension, numData, numCenters) ;
// Run at most 100 iterations of cluster refinement using Lloyd algorithm
vl_kmeans_refine_centers (kmeans, data, numData) ;
// Obtain the energy of the solution
energy = vl_kmeans_get_energy(kmeans) ;
// Obtain the cluster centers
centers = vl_kmeans_get_centers(kmeans) ;

Once the centers have been obtained, new data points can be assigned to clusters by using the vl_kmeans_quantize function:

vl_uint32 * assignments = vl_malloc(sizeof(vl_uint32) * numData) ;
float * distances = vl_malloc(sizeof(float) * numData) ;
vl_kmeans_quantize(kmeans, assignments, distances, data, numData) ;

Alternatively, one can directly assign new pointers to the closest centers, without bothering with a VlKMeans object.

There are several considerations that may impact the performance of KMeans. First, since K-means is usually based local optimization algorithm, the initialization method is important. The following initialization methods are supported:

Method Function Description
Random samples vl_kmeans_init_centers_with_rand_data Random data points
K-means++ vl_kmeans_init_centers_plus_plus Random selection biased towards diversity
Custom vl_kmeans_set_centers Choose centers (useful to run quantization only)

See Initialization methods for further details. The initialization methods use a randomized selection of the data points; the random number generator init is controlled by vl_rand_init.

The second important choice is the optimization algorithm. The following optimization algorithms are supported:

Algorithm Symbol See Description
Lloyd VlKMeansLloyd Lloyd's algorithm Alternate EM-style optimization
Elkan VlKMeansElkan Elkan's algorithm A speedup using triangular inequalities
ANN VlKMeansANN ANN algorithm A speedup using approximated nearest neighbors

See the relative sections for further details. These algorithm are iterative, and stop when either a maximum number of iterations (vl_kmeans_set_max_num_iterations) is reached, or when the energy changes sufficiently slowly in one iteration (vl_kmeans_set_min_energy_variation).

All the three algorithms support multithreaded computations. The number of threads used is usually controlled globally by vl_set_num_threads.