This document describes how the SURF, TOP-SURF, SIFT, C-SIFT, RGB-SIFT, OPPONENT-SIFT and GIST descriptors were extracted from the images in the MIRFLICKR collection.
Preprocessing
To extract the descriptors, we first preprocessed all images using the ImageMagick toolkit by resizing them to 256x256 using the Catrom filter. To resize a single image we issued the following command:
[imagemagick] [input] -resize 256!x256! -filter Catrom -path [output]
where [imagemagick] points to ImageMagick's 'mogrify' tool (e.g. /imagemagick/bin/mogrify), [input] points to the input image (e.g. /mirflickr/images/0/0.jpg) and [output] points to where the square output image is to be stored (e.g. /mirflickr/square/0/0.jpg).
________________________________________________________________________________________________________________
Interest points
To detect the interest points used for all of the interest point-based image descriptors, we have used a Fast Hessian-based interest point detector with default settings, namely:
octaves = 4
intervals = 4
sampling = 2
threshold = 0.0004
The Fast Hessian-based detector finds roughly 0-400 interest points per image, of which each will be associated with a feature vector as calculated by the descriptors below. We specifically use this interest point detector because other detectors often find many more; for instance the Hessian-Laplace-based detector typically detects 1000s of interest points per image, resulting in a descriptor that can easily take 1MB of space. Representing images with as few, but still sufficiently strong, interest points as possible is very important to reduce required storage space as well as reduce time needed to compare images with each other.
To detect and save the Fast Hessian-based interest points one can use the detector included in the OpenSURF toolkit. To save the interest points, so we can use the exact same points for each SIFT-based descriptor we want to compute using the Color Descriptors toolkit, we have to save them into the format used by this toolkit. This text format looks as follows:
[file format]
[descriptor size]
[detected interest points]
[interest point 1]
...
[interest point N]
where [file format] should be set to KOEN1, [descriptor size] to 0 because we only report on the interest point details and have not calculated a descriptor, [detected interest points] to the number of interest points found in the image, and [interest point x] will use the following format:
<CIRCLE [x] [y] [scale] [orientation] [corneredness]>; [features];
where [x] and [y] are 4-byte integers indicating the (x,y)-coordinates of the interest point as seen from the top-left of the image with starting index 1 (as in matlab), [scale], [orientation] and [corneredness] are floating-points indicating these particular properties of the interest point, and [features] is a list of numbers that form its feature vector. For the purpose of extracting the SIFT-based descriptors with the Color Descriptors toolkit, only saving the x and y is sufficient, although the Fast Hessian detector also provides scale and orientation so it is best to save it since it may come in handy when you are actually comparing the descriptors of two images; the detector does not calculate corneredness though. The [features] part can be left out when saving only the interest points, so you will get two semi-colons at the end.
The easiest way to save the interest points of an image to a separate file is to first extract its SURF descriptor (see below), then load this descriptor into memory, walk through all of the detected interest points and write each of their details to disk using the format above. This file describing the interest points can then be passed alongside the image to the Color Descriptor toolkit to extract the various SIFT descriptors. To see some example files, download the 'regions' archive from the main page, which contains the detected interest points ('regions') for the images in the collection.
________________________________________________________________________________________________________________
SURF
The SURF descriptors were extracted using the OpenSURF toolkit. The official version is written in C++, although it also has been ported to other languages like Java and C#. Note that these ports may not have the same quality as the original, in terms of its speed and more importantly the correctness of the detected points, so we recommend only using the official version. If you don't have a suitable development environment we suggest downloading the Eclipse IDE for C/C++ Developers to compile the toolkit.
We have used the following binary format to save the descriptor of an image:
[detected interest points]
[interest point 1]
...
[interest point N]
where [detected interest points] is a 4-byte integer set to the number of interest points found in the image, and [interest point n] uses the following format:
[x] [y] [scale] [orientation] [laplacian] [descriptor]
where [x] and [y] are 4-byte floating points indicating the (x,y)-coordinates of the interest point as seen from the top-left of the image with starting index 0 (as in C++), [scale] and [orientation] are 4-byte floating points and have the same meaning as indicated before, [laplacian] is a 4-byte integer indicating the sign of the Laplacian, and [descriptor] contains the actual SURF descriptor that consists of 64 4-byte floating points.
________________________________________________________________________________________________________________
TOP-SURF
We used the TOP-SURF toolkit to extract this descriptor, which represents images by visual words present in the image. First each interest point is described using SURF, after which it's best matching visual word is found in a codebook that in total contains 200,000 visual words. The codebook was created by k-means clustering several million SURF descriptors and associating each visual word with a weight based on its inverse document frequency. The visual words are then aggregated and the top 100 visual words with the highest accumulative weights are retained. The code of the toolkit is open source, written in C++ and available for Windows, Linux and Mac OS X. If you don't have a suitable development environment we suggest downloading the Eclipse IDE for C/C++ Developers to compile the toolkit.
We have used the following binary format to save the descriptor of an image:
[detected visual words]
[descriptor length]
[visual word 1]
...
[visual word N]
where [detected interest points] is a 4-byte integer set to the number of visual words found in the image, [descriptor length] is a 4-byte floating point indicating the vector length of the descriptor, which is useful for fast cosine distance calculation, and [visual word n] uses the following format:
[identifier] [tf] [idf] [detected locations] [location 1] ... [location M]
where [identifier] is a 4-byte integer indicating the visual word identifier, [tf] is a 4-byte floating point indicating the term frequency of the visual word in the codebook, and [idf] is a 4-byte floating point indicating the inverse document frequency of the visual word in the codebook; the [detected locations] refers to the number of times this visual word was encountered in the image, and [location m] uses the following format:
[x] [y] [scale] [orientation]
where [x], [y], [scale] and [orientation] have the same meaning as in SURF.
________________________________________________________________________________________________________________
SIFT
The SIFT, C-SIFT, RGB-SIFT and OPPONENT-SIFT descriptors were extracted using the Color Descriptors toolkit, which is available for Windows and Linux and is run from the command line. The toolkit can extract many other descriptors based on interest points and visual words, so it may prove useful to you when you are interested in looking beyond SIFT descriptors.
To extract the various SIFT descriptors, we first saved the interest points detected in each image to files using the KOEN1 format mentioned above. Then, we called the Color Descriptors toolkit as follows:
[toolkit] [input] --loadRegions [interest points] --output [output] --descriptor [descriptor]
where [toolkit] points to the Color Descriptors toolkit (e.g. ./colordescriptors), [input] points to the input image (e.g /mirflickr/square/0/0.jpg), [interest points] points to the file containing the interest points of the input image (e.g. /mirflickr/regions/0/0.txt), [descriptor] refers to the descriptor to extract (e.g. sift, csift, rgbsift, opponentsift), and [output] points to where the descriptor is to be stored. The output is saved in the KOEN1 format above, which is stored in a human-readable format but as a result is not efficient from neither a storage perspective nor a matching perspective. For this purpose we converted the descriptors to a binary format similar to how we saved the SURF descriptors, namely:
[detected interest points]
[interest point 1]
...
[interest point N]
where [detected interest points] is set to the number of interest points found in the image, and [interest point n] uses the following format:
[x] [y] [scale] [orientation] [cornerness] [descriptor]
where [x] and [y] are 4-byte floating points indicating the (x,y)-coordinates of the interest point as seen from the top-left of the image with starting index 0 (as in C++), [scale] and [orientation] are 4-byte floating points and have the same meaning as indicated before, [cornerness] is not used and can be set to 0, and [descriptor] contains the actual SIFT descriptor that consists of a varying number of 4-byte floating points; for SIFT this number is 128, whereas for C-SIFT, RGB-SIFT and OPPONENT-SIFT this is 384.
________________________________________________________________________________________________________________
GIST
The GIST descriptors were extracted using the LabelMe MATLAB toolbox with common settings, namely:
windows = 4
scales = 5
orientations = 6
frequency cycles = 4
The descriptor computes energy over windows at different orientations and scales. After extracting and adding the LabelMe toolbox to your MATLAB path, you can extract the GIST descriptor by following the example script listed at the 'gist descriptor' section near the bottom of the page from where you downloaded the toolbox. In terms of the parameters listed, remember that the image size we use is 256 and that we use the same number of orientations per scale level; the number of frequency cycles is used during prefiltering of the image and not during the actual calculation of the descriptor.
For the GIST descriptors we extracted we made slight modifications to the code. Instead of converting the color image to grayscale using the simple mean of the red, green and blue channels, we represented it as the luma (Y') channel of the YUV color space using the following matlab code:
rgb = imread('example.jpg');
rgb = (single)rgb;
gray = 0.299 * rgb(:,:,1) + 0.587 * rgb(:,:,2) + 0.114 * rgb(:,:,3);
The reason for this change is that this grayscale representation reflects the fact that the human eye is more sensitive to certain wavelengths of light than others, which affects the perceived brightness of a color, while simply averaging the three color channels results in a less natural representation. Note that the three values in the formula reflect those used in traditional CRT televisions, while for modern high-definition screens slightly different values are used for the red, green and blue color channels to retain the same perceived brightness as before.
We also did not use any boundary padding, due to the LabelMe toolbox incorporating a change after we had already extracted all features using their earlier code, so in the file LMgist.m of the LabelMe toolbox we changed the following:
param.boundaryExtension = 32 to param.boundaryExtension = 0
As long as you consistently use the GIST descriptors, either those we extracted or those you obtain by running the LabelMe toolkit, they can be compared with each other; just make sure to not mix them. The format of the GIST descriptor is as follows:
[scales] * [orientations] * [windows] * [windows]
where the energy is calculated at all [scales] and [orientations] for each of the blocks obtained by chopping up the image into [windows] by [windows] pieces, where each of the blocks at each orientation at each scale is represented by a 4-byte floating point value. To illustrate, for the parameters we used for the calculation of a GIST descriptor we thus end up with 6 * 5 * 4 * 4 = 480 values.