This page describes the Vector of Locally Aggregated Descriptors (VLAD) image encoding of  . See Vector of Locally Aggregated Descriptors (VLAD) encoding for an overview of the C API.

VLAD is a feature encoding and pooling method, similar to Fisher vectors. VLAD encodes a set of local feature descriptors $I=(\bx_1,\dots,\bx_n)$ extracted from an image using a dictionary built using a clustering method such as Gaussian Mixture Models (GMM) or K-means clustering. Let $q_{ik}$ be the strength of the association of data vector $\bx_i$ to cluster $\mu_k$, such that $q_{ik} \geq 0$ and $\sum_{k=1}^K q_{ik} = 1$. The association may be either soft (e.g. obtained as the posterior probabilities of the GMM clusters) or hard (e.g. obtained by vector quantization with K-means).

$\mu_k$ are the cluster means, vectors of the same dimension as the data $\bx_i$. VLAD encodes feature $\bx$ by considering the residuals

$\bv_k = \sum_{i=1}^{N} q_{ik} (\bx_{i} - \mu_k).$

The residulas are stacked together to obtain the vector

$\hat\Phi(I) = \begin{bmatrix} \vdots \\ \bv_k \\ \vdots \end{bmatrix}$

Before the VLAD encoding is used it is usually normalized, as explained VLAD normalization next.

• Component-wise mass normalization. Each vector $\bv_k$ is divided by the total mass of features associated to it $\sum_{i=1}^N q_{ik}$.
• Square-rooting. The function $\sign(z)\sqrt{|z|}$ is applied to all scalar components of the VLAD descriptor.
• **Component-wise $l^2$ normalization.** The vectors $\bv_k$ are divided by their norm $\|\bv_k\|_2$.
• **Global $l^2$ normalization.** The VLAD descriptor $\hat\Phi(I)$ is divided by its norm $\|\hat\Phi(I)\|_2$.