KD-trees and forests

kdtree.h implements a KD-tree object, a data structure that can efficiently index moderately dimensional vector spaces. Both best-bin-first [4] and randomized KD-tree forests are implemented [27] ,[21] . Applications include fast matching of feature descriptors.

# Overview

To create a VlKDForest object use vl_kdforest_new specifying the dimensionality of the data and the number of trees in the forest. With one tree only, the algorithm is analogous to [4] (best-bin KDTree). Multiple trees correspond to the randomized KDTree forest as in [27] ,[21] .

To let the KD-tree index some data use vl_kdforest_build. Note that for efficiency KD-tree does not copy the data but retains a pointer to it. Therefore the data must exist (and not change) until the KD-tree is deleted. To delete the KD-tree object, use vl_kdforest_delete.

To find the N nearest neighbors to a query point first instantiate a VlKDForestSearcher and then start search using a vl_kdforest_query with the searcher object as an argument. To set a maximum number of comparisons per query and calculate approximate nearest neighbors use vl_kdforest_set_max_num_comparisons.

# Technical details

VlKDForest implements the best-bin-first kd-tree of [4] .

Construction. Given a set of points $x_1,\dots,x_n \in \mathbb{R}^d$, the algorithm recursively partitions the d dimensional Euclidean space $\mathbb{R}^d$ into (hyper-) rectangles.

Partitions are organized into a binary tree with the root corresponding to the whole space $\mathbb{R}^d$. The algorithm refines each partition by dividing it into two halves by thresholding along a given dimension. Both the splitting dimension and the threshold are determined as a statistic of the data points contained in the partition. The splitting dimension is the one which has largest sample variance and the splitting threshold is either the sample mean or the median. Leaves are atomic partitions and they contain a list of zero or more data points (typically one).

Querying. Querying amounts to finding the N data points closer to a given query point $x_q \in \mathbb{R}^d$. This is done by branch-and-bound. A search state is an active partition (initially the root) and it is weighed by the lower bound on the distance of any point in the partition and the query point. Such a lower bound is trivial to compute because partitions are hyper-rectangles.

Querying usage. As said before a user has to create an instance VlKDForestSearcher using vl_kdforest_new_searcher in order to be able to make queries. When a user wants to delete a KD-Tree all the searchers bound to the given KD-Forest are erased automatically. If a user wants to delete some of the searchers before the KD-Tree erase, he could do it using the vl_kdforest_delete_searcher method.