The FV of [23] is a special case of the Fisher kernel construction. It is designed to encode local image features in a format that is suitable for learning and comparison with simple metrics such as the Euclidean. In this construction, an image is modeled as a collection of \(D\)-dimensional feature vectors \(I=(\bx_1,\dots,\bx_n)\) generated by a GMM with \(K\) components \(\Theta=(\mu_k,\Sigma_k,\pi_k:k=1,\dots,K)\). The covariance matrices are assumed to be diagonal, i.e. \(\Sigma_k = \diag \bsigma_k^2\), \(\bsigma_k \in \real^D_+\).

The generative model of *one* feature vector \(\bx\) is given by the GMM density function:

\[ p(\bx|\Theta) = \sum_{k=1}^K \pi_k p(\bx|\Theta_k), \quad p(\bx|\Theta_k) = \frac{1}{(2\pi)^\frac{D}{2} (\det \Sigma_k)^{\frac{1}{2}}} \exp \left[ -\frac{1}{2} (\bx - \mu_k)^\top \Sigma_k^{-1} (\bx - \mu_k) \right] \]

where \(\Theta_k = (\mu_k,\Sigma_k)\). The Fisher Vector requires computing the derivative of the log-likelihood function with respect to the various model parameters. Consider in particular the parameters \(\Theta_k\) of a mode. Due to the exponent in the Gaussian density function, the derivative can be written as

\[ \nabla_{\Theta_k} p(\bx|\Theta_k) = p(\bx|\Theta_k) g(\bx|\Theta_k) \]

for a simple vector function \(g\). The derivative of the log-likelihood function is then

\[ \nabla_{\Theta_k} \log p(\bx|\Theta) = \frac{\pi_k p(\bx|\Theta_k)}{\sum_{t=1}^K \pi_k p(\bx|\Theta_k)} g(\bx|\Theta_k) = q_k(\bx) g(\bx|\Theta_k) \]

where \(q_k(\bx)\) is the soft-assignment of the point \(\bx\) to the mode \(k\). We make the approximation that \(q_k(\bx)\approx 1\) if \(\bx\) is sampled from mode \(k\) and \(\approx 0\) otherwise [23] . Hence one gets:

\[ E_{\bx \sim p(\bx|\Theta)} [ \nabla_{\Theta_k} \log p(\bx|\Theta) \nabla_{\Theta_t} \log p(\bx|\Theta)^\top ] \approx \begin{cases} \pi_k E_{\bx \sim p(\bx|\Theta_k)} [ g(\bx|\Theta_k) g(\bx|\Theta_k)^\top], & t = k, \\ 0, & t\not=k. \end{cases} \]

Thus under this approximation there is no correlation between the parameters of the various Gaussian modes.

The function \(g\) can be further broken down as the stacking of the derivative w.r.t. the mean and the diagonal covariance.

\[ g(\bx|\Theta_k) = \begin{bmatrix} g(\bx|\mu_k) \\ g(\bx|\bsigma_k) \end{bmatrix}, \quad [g(\bx|\mu_k)]_j = \frac{x_j - \mu_{jk}}{\sigma_{jk}^2}, \quad [g(\bx|\bsigma_k^2)]_j = \frac{1}{2\sigma_{jk}^2} \left( \left(\frac{x_j - \mu_{jk}}{\sigma_{jk}}\right)^2 - 1 \right) \]

Thus the covariance of the model (Fisher information) is diagonal and the diagonal entries are given by

\[ H_{\mu_{jk}} = \pi_k E[g(\bx|\mu_{jk})g(\bx|\mu_{jk})] = \frac{\pi_k}{\sigma_{jk}^2}, \quad H_{\sigma_{jk}^2} = \frac{\pi_k}{2 \sigma_{jk}^4}. \]

where in the calculation it was used the fact that the fourth moment of the standard Gaussian distribution is 3. Multiplying the inverse square root of the matrix \(H\) by the derivative of the log-likelihood function results in the Fisher vector encoding of one image feature \(\bx\):

\[ \Phi_{\mu_{jk}}(\bx) = H_{\mu_{jk}}^{-\frac{1}{2}} q_k(\bx) g(\bx|\mu_{jk}) = q_k(\bx) \frac{x_j - \mu_{jk}}{\sqrt{\pi_k}\sigma_{jk}}, \qquad \Phi_{\sigma^2_{jk}}(\bx) = \frac{q_k(\bx)}{\sqrt{2 \pi_k}} \left( \left(\frac{x_j - \mu_{jk}}{\sigma_{jk}}\right)^2 - 1 \right) \]

Assuming that features are sampled i.i.d. from the GMM results in the formulas given in Fisher vector fundamentals (note the normalization factor). Note that:

The Fisher components relative to the prior probabilities \(\pi_k\) have been ignored. This is because they have little effect on the representation [24] .

Technically, the derivation of the Fisher Vector for multiple image features requires the number of features to be the same in both images. Ultimately, however, the representation can be computed by using any number of features.