SIFT

Scale Invariant Feature Transform(SIFT), like its name implies, is a feature detection and matching algorithm that is scale invariant.

Detection of scale space extrema

Scale space

The scale space of an image is defined as the function

Where

  • is the image
  • is a gaussian

The scale space of an image is obtained by convolving a variable scaled gaussian with the image.
We use this scale space concept to generate what are called octaves for an image. An octave consists of a fixed number of images, where each successive image is convolved with a gaussian with a scale difference of, say .

Example
In an octave:



In the next octave, the original image is downsampled (usually by an order of 2) and the Gaussians are again applied successively.

Difference of gaussians

Stable keypoints are detected using scale space extrema, with difference of gaussians. The difference of gaussians is just the difference in scale spaces of an image at different scales. For example, the scales here are seperated by a factor of :

This difference functions is a close approximation to the normalized Laplacian of Gaussian .
We can see the above sentence here:

Using the finite difference approximation for the parital derivative:

Therefore;

Extrema

Once the scale space has been generated, the DoG is calculated for every adjacent pair of images in an octave. Now, we find the extrema - both minima and maxima. A pixel is marked as a candidate if it is the extrema in a 3x3x3 window around it. To further clarify, for a pixel to be a candidate, it has to be an extrema in 8 pixels around it in the same image, 9 pixels above it in the next scale, and 9 pixels below it in the previous scale.