The Scale-Invariant Feature Transform (SIFT) bundles a feature detector and a feature descriptor. The detector extracts from an image a number of frames (attributed regions) in a way which is consistent with (some) variations of the illumination, viewpoint and other vieweing conditions. The descriptor associates to the regions a signature which identifies their appearance compactly and robustly.

Extracting frames and descriptors

Both the detector and descriptor are accessible by the sift MATLAB command (there is a similar command line utility). Open MATLAB and load a test image

pfx = fullfile(vlfeat_root,'data','a.jpg') ;
I = imread(pfx) ;
image(I) ;
Input image.

The sift command requires the an image in gray scale format and single precision.It also expects the range to be normalized in the [0,255] interval (while this is not strictly required, the default values of some internal thresholds are tuned for this case). The image I is converted in the appropriate format by

I = float(rgb2gray(I)) ;

We compute the SIFT frames (keypoints) and descriptors by

[f,d] = sift(I) ;

The matrix f has a column for each frame. A frame is a disk of center f(1:2), scale f(3) and orientation f(4) . We visualize a random selection of 50 features by:

perm = randperm(size(f,2)) ; 
sel  = perm(1:50) ;
h1   = plotframe(f(:,sel)) ; 
h2   = plotframe(f(:,sel)) ; 
set(h1,'color','y','linewidth',3) ;
set(h2,'color','k','linewidth',1) ;
Some of the detected SIFT frames.

We can also overlay the descriptors by

h3 = plotsiftdescriptor(d(:,sel),f(:,sel)) ;  
set(h3,'color','g') ;
A test image for the peak threshold parameter.

Detector parameters

The SIFT detector is controlled mainly by two parameters: the peak threshold and the (non) edge treshold.

The peak treshold filters peaks of the DoG scale space that are too small (in absolute value). For instance, consider the test image obtained as a gradient of Gaussian blobs:

I = double(rand(100,500) <= .005) ;
I = (ones(100,1) * linspace(0,1,500)) .* I ;
I(:,1) = 0 ; I(:,end) = 0 ;
I(1,:) = 0 ; I(end,:) = 0 ;
I = 2*pi*4^2 * imsmooth(I,4)
I = single(255 * I) ;

We run the detector with peak treshold \verb$x$ by

f = sift(I, 'PeakTresh', x) ;

obtaining less and less features

Selected frames for varying peak threshold.

Similary, the edge trehsold instead eliminates peaks of the DoG scale spcae whose curvature is too small (the reason is that such peaks yields badly localized frames). For instance, consider the test image

I = zeros(100,500) ;
for i=[10 20 30 40 50 60 70 80 90]
  I(50-round(i/3):50+round(i/3),i*5) = 1 ;
end
I = 2*pi*8^2 * imsmooth(I,8) ;
I = single(255 * I) ;
A test image for the edge threshold parameter.

We run the detector with edge treshold x by

f = sift(I, 'EdgeTresh', x) ;

obtaining more and more features:

Selected frames for varying edge threshold.

Custom frames

The MATLAB command sift (and the command line utility) can bypass the detector and run the descriptor on custom frames by means of the Frames option.

Custom frames with computed orientations.

For instance, we can compute the descriptor of a SIFT frame centered at position (100,100), of scale 10 and orientation pi/8 by

fc = [100;100;10;pi/8] ;
[f,d] = sift(I,'frames',fc) ;

Mutiple frames fc an be specified as well. In this case they are re-ordered by increasing scale. Th Orientations option insturcts the program to use the custom position and scale but to compute the keypoint orientations, as in

fc = [100;100;10;0] ;
[f,d] = sift(I,'frames',fc,'orientations') ;

Notice that, depending on the local appearance, a keypoint may have multiple orientations. Moreover, a keypoint computed on a constant image region (such as one big as one pixel) has no orientations!

Conventions

In our implementation SIFT frames are expressed in the standard image reference. The only difference between the command line and MATLAB drivers is that the latter assumes that the image origin (top-left corner) has coordinate (1,1) as opposed to (0,0). Lowe's original implementation uses a different reference system, illustrated next:

Our conventions (top) compared to Lowe's (bottom).

Our implementation uses the standard image reference systesm, with the y axis pointing downard. The frame orientation θ and descriptor use the same reference system (i.e. a small positive rotation of the x moves it towards the y axis). Recall that each descriptor elment is a bin indexed by (θ,x,y); the histogram is vectorized in such a way that θ is the fastest varying index and y the slowest.

By comparison, D. Lowe's implementation (see bottom half of the figure) uses a slightly different convention: Frames centers are expressed relatively to the standard image reference system, but the frames orientation and the descriptor assume that the $y$ axis points upward. Consequently, to map from our to D. Lowe's convention, frames orientations need to be negated and the descriptor elements must be re-arranged.