Scale Invariant Feature Transform(SIFT), like its name implies, is a feature detection and matching algorithm that is scale invariant.
The scale space of an image is defined as the function
Where
The scale space of an image is obtained by convolving a variable scaled gaussian with the image.
We use this scale space concept to generate what are called octaves for an image. An octave consists of a fixed number of images, where each successive image is convolved with a gaussian with a scale difference of, say
Example
In an octave:
In the next octave, the original image is downsampled (usually by an order of 2) and the Gaussians are again applied successively.
Stable keypoints are detected using scale space extrema, with difference of gaussians. The difference of gaussians is just the difference in scale spaces of an image at different scales. For example, the scales here are seperated by a factor of
This difference functions is a close approximation to the normalized Laplacian of Gaussian
We can see the above sentence here:
Using the finite difference approximation for the parital derivative:
Therefore;
Once the scale space has been generated, the DoG is calculated for every adjacent pair of images in an octave. Now, we find the extrema - both minima and maxima. A pixel is marked as a candidate if it is the extrema in a 3x3x3 window around it. To further clarify, for a pixel to be a candidate, it has to be an extrema in 8 pixels around it in the same image, 9 pixels above it in the next scale, and 9 pixels below it in the previous scale.