Fisher kernel

This page discusses the Fisher Kernels (FK) of [10] and shows how the FV of [23] can be derived from it as a special case. The FK induces a similarity measures between data points $\bx$ and $\bx'$ from a parametric generative model $p(\bx|\Theta)$ of the data. The parameter $\Theta$ of the model is selected to fit the a-priori distribution of the data, and is usually the Maximum Likelihood (MLE) estimate obtained from a set of training examples. Once the generative model is learned, each particular datum $\bx$ is represented by looking at how it affects the MLE parameter estimate. This effect is measured by computing the gradient of the log-likelihood term corresponding to $\bx$:

$\hat\Phi(\bx) = \nabla_\Theta \log p(\bx|\Theta)$

The vectors $\hat\Phi(\bx)$ should be appropriately scaled before they can be meaningfully compared. This is obtained by whitening the data by multiplying the vectors by the inverse of the square root of their covariance matrix*. The covariance matrix can be obtained from the generative model $p(\bx|\Theta)$ itself. Since $\Theta$ is the ML parameter and $\hat\Phi(\bx)$ is the gradient of the log-likelihood function, its expected value $E[\hat\Phi(\bx)]$ is zero. Thus, since the vectors are already centered, their covariance matrix is simply:

$H = E_{\bx \sim p(\bx|\Theta)} [\hat\Phi(\bx) \hat\Phi(\bx)^\top]$

Note that $H$ is also the Fisher information matrix of the model. The final FV encoding $\Phi(\bx)$ is given by the whitened gradient of the log-likelihood function, i.e.:

$\Phi(\bx) = H^{-\frac{1}{2}} \nabla_\Theta \log p(\bx|\Theta).$

Taking the inner product of two such vectors yields the Fisher kernel:

$K(\bx,\bx') = \langle \Phi(\bx),\Phi(\bx') \rangle = \nabla_\Theta \log p(\bx|\Theta)^\top H^{-1} \nabla_\Theta \log p(\bx'|\Theta).$