Tutorials - Retrieval

Image retrieval benchmark tutorial

The source code of this tutorial is available in retrievalDemo.m function, this text shows only selected parts of the code.

The current implementation does not support Microsoft Windows platforms.

Retrieval benchmark

The image retrieval benchmark tests feature extractorss in a simple image retrieval system. The retrieval benchmark closely follows the work Jegou et. al [1]. First a set of local features is detected by selected feature extractor, and described using selected descriptor. To find most similar features it employs a K-Nearest neighbours search over descriptors from the all dataset images. Finally, a simple voting criterion based on K-nearest descriptors distances is used to sort the images (for the details c.f. [1]).

The dataset used in the evaluation consists of a set of images and a set of queries. Set of ground truth images for each query is split into three classes 'Good', 'Ok', 'Junk'. For each query, the average precision (area under the precision-recall curve) is calculated and averaged over all queries to get mean Average Precision (mAP) of the feature extractor.

Feature extractors algorithms comparison

The main purpose of this benchmark is to compare the feature extraction algorithms. In this tutorial we have selected feature extractors, which are part of the VLFeat library:

featExtractors{1} = VlFeatCovdet('method', 'hessianlaplace', ...
                                 'estimateaffineshape', true, ...
                                 'estimateorientation', false, ...
                                 'peakthreshold',0.0035,...
                                 'doubleImage', false);
featExtractors{2} = VlFeatCovdet('method', 'harrislaplace', ...
                                 'estimateaffineshape', true, ...
                                 'estimateorientation', false, ...
                                 'peakthreshold',0.0000004,...
                                 'doubleImage', false);
featExtractors{3} = VlFeatSift('PeakThresh',4);

The first two image feature extractors are affine covariant whereas the third one is just similarity invariant and is closely similar to Lowe's original SIFT feature extractor (DoG detector, in fact). All local features are described using by SIFT descriptor.

To perform the image retrieval benchmark we defined a subset of the original 'The Oxford Buildings' dataset to compute the results in a reasonable time.

dataset = VggRetrievalDataset('Category','oxbuild',...
                              'OkImagesNum',inf,...
                              'JunkImagesNum',100,...
                              'BadImagesNum',100);

The subset of Oxford buildings contains only 748 images as only a part of the 'Junk' and 'Bad' images is included. 'Bad' are images which does not contain anything from the queries. The original dataset consist of 5062 images.

Now an instance of a benchmark class is created. The benchmark computes the nearest neighbours over all images. This can be too memory consuming, however the search can be split into several parts and the results merged. The parameter 'MaxNumImagesPerSearch' sets how many images are processed at one KNN search.

For the call:
  retBenchmark = RetrievalBenchmark('MaxNumImagesPerSearch',100);
The estimated memory consumption is approximately:
 100 * 2000 * 128 * 4 ~ 100MB of memory 

Given each image contains around 2000 features on average, each feature is described by 128 bytes long descriptor and the fact that the used KNN algorithm works only with single (4 Byte) values.

Finally we can run the benchmark:

for d=1:numel(featExtractors)
  [mAP(d) info(d)] =...
    retBenchmark.testFeatureExtractor(featExtractors{d}, dataset);
end

Even having a subset of the Oxford buildings dataset, it takes a while to evaluate the benchmark for selected feature extractors. The feature extraction for a single image takes several seconds so overall the feature extraction takes approximately:

 3 × 748 × 3 = 6732s ~ 2h 

Giving you a plenty of time for a coffee or even a lunch. Fortunately if you have setup Matlab Parallel Computing Toolbox running this benchmark with open matlabpool can run feature extraction and KNN computation in parallel.

Both the features and partial KNN search results are stored in the cache so the computation can be interrupted and resumed at any time.

Average precisions

The results of the benchmark can be viewed at several levels of detail. The most general result is the mean Average Precision (mAP), a single value per a feature extractor. The mAPs can be visualises in a bar graph:

detNames = {'VLF-heslap', 'VLF-harlap', 'VLFeat-SIFT'};
bar(mAP);
Mean average precision of the selected feature extractors.
    VLF-heslap  VLF-harlap    VLF-SIFT
mAP      0.721       0.772       0.758

Note, that these values are computed only over small part of the Oxford Buildings dataset, and therefore are not directly comparable to the other state-of-the-art image retrieval systems run on the Oxford Buildings dataset. On the other hand this benchmark does not include any quantisation or spatial verification and thus is more focused on comparison of image features extractors.

An important measure in the features extractors comparison is a number of descriptors computed for each image, or average number of features per image. The average numbers of features can be easily obtained using:

numDescriptors = cat(1,info(:).numDescriptors);
numQueryDescriptors = cat(1,info(:).numQueryDescriptors);
avgDescsNum(1,:) = mean(numDescriptors,2);
avgDescsNum(2,:) = mean(numQueryDescriptors,2);

It can be seen that the selected set of feature extractors produce similar number of features with the selected settings:

                    VLF-heslap  VLF-harlap    VLF-SIFT
      Avg. #Descs.    1803.822    1678.195    1843.202
Avg. #Query Descs.     892.582     869.255     853.582

To get better insight where the extractors differ, we can plot the APs per each query. These values are also contained in the info structure. For example the APs for the first 15 queries can be visualised by:

queriesAp = cat(1,info(:).queriesAp); % Values from struct to single array
selectedQAps = queriesAp(:,1:15);     % Pick only first 15 queries
bar(selectedQAps');
Average precisions of the feature extractors over the first 15 queries.

As you can see there are big differences between the queries. For example query number 8 as it is a query where the SIFT feature extractor gets much worse than other algorithms. Let's investigate the query number 8 in more detail. In the first step, we can show the precision recall curves.

Precision recall curves

The precision-recall curves are not part of the results but they can be easily calculated using the RetrievalBenchmark.rankedListAp(query, rankedList) static method.

queryNum = 8;
query = dataset.getQuery(queryNum);

for d=1:numel(featExtractors)
  % AP is calculated only based on the ranked list of retrieved images
  rankedList = rankedLists{d}(:,queryNum);
  [ap recall(:,d) precision(:,d)] = ...
    retBenchmark.rankedListAp(query, rankedList);
end

% Plot the curves
plot(recall, precision,'LineWidth',2); 
8th query precision-recall curves of the tested feature extractors.

In this graph it can be seen that a VLF-SIFT (DoG + SIFT descriptor) achieved lower AP score because one of the first ranked images is wrong, therefore the area under the precision recall curve shrinks significantly.

We can check it by showing the retrieved images.

Retrieved images

The indices of the retrieved images are stored in info.rankedList field. It contains indices of all dataset images sorted by the voting score of each image. The details of the voting scheme can be found in the help string of the retrieval benchmark class or in [1].

To asses the performance of the feature extractor, let's inspect the query number 8.

image(imread(dataset.getImagePath(query.imageId)));
% Convert query rectangle [xmin ymin xmax ymax] to [x y w h]
box = [query.box(1:2);query.box(3:4) - query.box(1:2)];
rectangle('Position',box,'LineWidth',2,'EdgeColor','y');
Query image of the 8th query with the query bounding box.

Having the ranked list we can show the retrieved images for all feature extractors.

rankedLists = {info(:).rankedList}; % Ranked list of the retrieved images
numViewedImages = 20;
for ri = 1:numViewedImages
  % The first image is the query image itself
  imgId = rankedList(ri+1);
  imgPath = dataset.getImagePath(imgId);
  subplot(5,numViewedImages/5,ri); 
  subimage(imread(imgPath));
end

Images retrieved by the VLFeat Hessian Laplace:

First 20 retrieved images by VLFeat Hessian Laplace feature extractor.

Image retrieved by the VLFeat Harris Laplace:

First 20 retrieved images by VLFeat Harris Laplace feature extractor.

And finally, images retrieved by the VLFeat SIFT:

First 20 retrieved images by VLFeat SIFT feature extractor.

Observe that indeed the image ranked as the second is wrong for the VLF- SIFT feature extractor, consequently reducing its AP.

References

  1. H. Jegou, M. Douze, and C. Schmid. Exploiting descriptor distances for precise image search. Technical Report 7656, INRIA, 2011.