Welcome to the 5th edition of the Scalable Concept Image Annotation Challenge
How can you use large scale noisy data to improve image classification and caption generation?
Since 2010, ImageCLEF has run a scalable image annotation task, to promote research into the annotation of images using noisy web page data. It aims to develop techniques to allow computers to reliably describe images, localize different concepts depicted in the images and generate descriptions of the scenes. The main goal of the challenge is to encourage creative ideas of using web page data to improve image annotation.
Motivation
Every day, users struggle with the ever-increasing quantity of data available to them. Trying to find “that” photo they took on holiday last year, the image on Google of their favourite actress or band, or the images of the news article someone mentioned at work. There are a huge number of images that can be cheaply found and gathered from the Internet. However, more valuable is mixed-modality data, for example, web pages containing both images and text. A large amount of information about the image is present on these web pages and vice-versa. However, the relationship between the surrounding text and images varies greatly, with much of the text being redundant and/or unrelated. Despite the obvious benefits of using such information in automatic learning, the very weak supervision it provides means that it remains a challenging problem.
Following the challenge in 2015 there are 3 main subtasks based on noisy web page data containing 500,000 images with their corresponding web pages. The objective of subtask 1 is to annotate and localize concepts in a web page image by using all the available data from the actual web page. The objective of subtask 2 is to generate a natural language caption of a web page image, also by exploiting the noisy web page data. The objective of subtask 3 is to identify concepts from all those depicted in an image, that would be referred to by human annotators when writing a caption of the image i.e. salient concepts. The first two tasks require participants to apply a system that receives as input an image and its corresponding web page and produces as output a list of concepts present in that image (for Subtask 1) or a caption describing the image (for Subtask 2). Subtask 3 requires participants to develop a system that receives as input labelled bounding boxes of concepts in an image and identifies the bounding boxes that are referred to in the corresponding image caption.
In addition to the three main subtasks we introduce two new trial teasers. Text Illustration teaser consists of testing systems that analyse a text document only and selects the best illustration for it from a given collection of images. Geolocation teaser aims at evaluating systems that identify the GPS coordinates of the document’s topic based on its text and image data.
As in last year’s task, external training data such as ImageNet ILSVRC2015 and MSCOCO is allowed and encouraged, however in contrast to previous years, in this edition participants are expected to produce two sets of related results.
One approach using only the externally trained data;
The second approach using both the external data and the noisy web data of the 500,000 webpages.
Subtask 1: Image Annotation and Localisation
The evaluated systems will label the image with concepts together with their bounding box in the image. The models can be trained from both, the ImageCLEF data as well as external training data. Therefore, trained ImageNet CNNs can be used, although this will be the baseline. We encourage the use of the provided training data and the use of other resources such as ontologies, word disambiguators, language models, language detectors, spell checkers, and automatic translation systems. The design and development of the systems must emphasize on scalability. As for this year, the test set are all 500,000 images, therefore all images will need to be annotated with concepts and their bounding boxes.
Subtask 2: Natural Language Caption Generation
This subtask is geared towards participants interested in developing systems that generate textual descriptions directly with an image as input, e.g. by using visual detectors to identify concepts and generating textual descriptions from the detected concepts. Participants are welcome to use their own image analysis methods, for example by using the output of their image annotation systems developed for Subtask 1. They are also encouraged to augment their training data with the noisy content of the web page.
Subtask 3: Content Selection
This subtask is designed primarily for those interested in the Natural Language Generation aspects of Subtask 2 while avoiding visual processing of images. It concentrates on the content selection phase when generating image descriptions, i.e. which concepts (from all possible concepts depicted) should be selected to be mentioned in the corresponding description? A gold standard input (all bounding boxes labelled with concepts) will be provided for each test image and participants are expected to evaluate systems that identify the bounding box instances most likely to be mentioned in the corresponding descriptions of the image. As in all subtasks, the participants are encouraged to use the noisy web page data as a source for inference.
The Concepts
The concepts this year are the same as last year. They are chosen to be visual objects that are localizable and that are useful for generating textual descriptions of visual content of images. They include animate objects such as person, dogs and cats, inanimate objects such as houses, cars and balls, and scenes such as city, sea and mountains. The concepts are mined from the texts of our large database of image-webpage pairs. Nouns that are subjects or objects of sentences are extracted and mapped onto WordNet synsets. These are then filtered to 'natural', basic-level categories ('dog' rather than a 'yorkshire terrier'), based on the WordNet hierarchy and heuristics from a large-scale text corpora. The final list of concepts are manually shortlisted by the organizers such that they are (i) visually concrete and localizable; (ii) suitable for use in image descriptions; (iii) at a suitable 'every day' level of specificity that is neither too general nor too specific.
Teaser 1: Text Illustration
Teaser 1 is designed to evaluate the performance of methods for text to image matching. Participants are asked to analyse a given text document and find the best illustration for it from a set of all available images. The training set consists of approximately 300,000 documents from ImageCLEF2016 corpus. The remaining 200,000 will be used for testing and cannot be explored during training. A separate development set of about 3,000 image-webpage pairs is also provided as a validation set for parameter tuning and optimisation purposes. For the training and validation ground truth, the best illustration of a document is assumed to be the one(s) already associated with this document. At test time, participants will be provided a collection of text documents extracted from a subset of the 200,000 test documents as queries. For each test document, the task is to select the best illustration from the collection of 200,000 images in the test set. Thus, the 200,000 test documents must not be used for training. Any external training data can be used.
Teaser 2: Geolocation
Teaser 2 consists of finding the GPS location of the dominant topic in a document by analysing its text and image. The development set contains approximately 3,000 documents with GPS locations and any additional training data can be used. Again, the test data is the text from a subset of the 200,000 documents of the test set.
Registering for the task and accessing the data
Please register by following the instructions found in the main webpage of ImageCLEF 2016 webpage.
Following the approval of registration for the task, the participants will be given access rights to download the data files. The access details can be found at the ImageCLEF system -> Collections -> c_ic16_image_annotation Detail .
Experimental Process
A development set with ground truth localised concept labels and sentence descriptions will be provided to participants. The overall performance will be evaluated by asking participants to annotate and localise concepts and /or generate sentence descriptions on the 500,000 web page items.
Submission instructions
The submissions will be received through the ImageCLEF 2016 system. Go to "Runs", then "Submit run", and then select track "ImageCLEF2016:photo-annotation".
Participants will be permitted to submit up to 20 runs. As in last year’s task, external training data such as ImageNet ILSVRC2015 and MSCOCO is allowed and encouraged. However, in contrast to previous years, in this edition participants are expected to produce two sets of related results. One approach using only the externally trained data (maximum 10 runs). The second approach using both the external data and the noisy web data of the 500,000 webpages (maximum 10 runs).
Given the size of the result files, we would like participants to host their result files (a temporary Google Drive would fulfill this requirement) and then share the link of the folder within the submission system. When submitting your results please upload a file containing a string like "I have run the validator script and it passed fine" to indicate you have used the validator script) and insert the shared link to your result file in the "Method description" textbox.
Each system run will consist of a single ASCII plain text file. The results of each test set image or document (depending on the task) should be given in separate lines in the text file. The format of the text file is as follows:
[subtask_id] [image_ID/document_ID] [results]
where [subtask_id] are:
1 for Subtask 1
2 for Subtask 2
3 for Subtask 3
4 for Teaser 1
5 for Teaser 2
Subtask 1: Image Annotation and Localisation
The results of each test set image should be given in separate lines, each line providing only up to 100 localised concepts, with up to 100 localisations of the same concept expected. The format has characters to separate the elements, colon ':' for the confidence, comma ',' to separate multiple bounding boxes, and 'x' and '+' for the size-offset bounding box format, i.e.:
The format for each line is: 2 [image_ID] [sentence] . For example:
2 -0-QJyJXLD_48kXv Two boys are playing with a dog.
Subtask 3: Content Selection
The format for each line is: 3 [image_ID] [list of selected bounding box id separated by commas].
For example, if ten bounding boxes are given as input (IDs 0 to 9), and your system selected bounding boxes 0, 2, and 5:
3 -0-QJyJXLD_48kXv 0,2,5
Teaser 1: Text Illustration
The format for each line is: 4 [document_ID] [list of top 100 ranked image ID separated by commas].
The images should be listed in order of confidence (more relevant first). For example, if jlsCv7gwAE1ScDJ2 is the top ranked image, LCbsocZO4I1KtVf1 is at rank 2, and 2WFgg3tHINzdbLfs is at rank 100:
The format for each line is: 5 [document_ID] [latitude] [longitude]. The latitude should be between -90 and 90 degrees (inclusive), the longitude should be between -180 and 180 degrees (inclusive). For example:
5 zwtm5ZcmsvTZydW0 51.5085 -6.24889
Verification
A script is available for verifying the correct format of the files. The verification script can be downloaded from the validation folder in the dataset download area and it can be used as follows for each respective subtask:
A set of example run files are also provided in the validation folder.
Files with the suffix "bad" are examples that do not pass the verification process:
example_run_subtask1_bad.txt doesn’t pass the validation because the third image has 101 BBs for the first concept.
example_run_subtask2_bad.txt contains an image without any description.
example_run_subtask3_bad.txt contains a non-integer as a selected bounding box instance.
example_run_teaser1_bad.txt contains an image not from the 2000 image collection.
example_run_teaser2_bad.txt contains an invalid longitude value (less than -180).
Files without the suffix "bad" should pass the verification process.
When submitting your results please upload a file containing a string like "I have run the validator script and it passed fine" to indicate you have used the validator script and insert the shared link to your result file in the "Method description" textbox.
Evaluation Methodology
Subtask 1: Image Annotation and Localisation
Localisation of Subtask 1 will be evaluated using the PASCAL VOC style metric of intersection over union (IoU):
where \(BB\) is a rectangle bounding box, \(fg\) is a foreground proposed annotation label, \(gt\) is the ground truth label of the concept. It calculates the area of intersection between the foreground in the proposed output localization and the ground-truth bounding box localization, divided by the area of their union.
Subtask 2: Natural Language Caption Generation
Subtask 2 will be evaluated using the Meteor evaluation metric against a minimum of five human-authored textual descriptions as the gold standard reference.
Subtask 3: Content Selection
Subtask 3 will be evaluated using the content selection metric, which is the \(F_1\) score averaged across all test images, where each \(F_1\) score is computed from the precision and recall averaged over all gold standard descriptions for the image. Please refer to Wang & Gaizauskas (2015) or Gilbert et al. (2015) for more details about the metrics.
Teaser 1: Text Illustration
For Teaser 1, the test images are ranked according to their distance to the query article. Recall at the \(k\)-th rank position (R@K) of the ground truth image will be used as performance metrics. Several values of \(k\) will be tested, and participants are asked to submit the top 100 ranked images. Please refer to Hodosh et al. (2013) for more details about the metrics.
Teaser 2: Geolocation
For Teaser 2, we will use the great circle distance (GCD) as the performance metric, which is defined as the shortest distance on the surface of a sphere, measured along the surface.
More specifically, let \(G=(g_1, g_2)\) be the pair of ground truth latitude and longitude of an article measured in radians, \(P=(p_1, p_2)\) be the pair of predicted latitude and longitude, and \(R=6137km\) be the radius of the earth. We use the spherical law of cosines approximation: $$GCD = R \times \arccos(\sin(g_1) \times \sin(p_1) + \cos(g_1) \times \cos(p_1) \times \cos(d_2) )$$
where \(d_2 = p_2 - g_2\).
Results
Results for all sub tasks are contained within this folder.
Subtask 1: Image Annotation and Localisation
MAP_0.5Overlap Is the localised Mean average precision (MAP) for each submitted method for using the performance measure of 50% overlap of the ground truth
MAP_0_Overlap Is the image annotation MAP for each method with success if the concept is simply detected in the image without any localisation
MAP_IncreasingMAP Is the localisation accuracy MAP using an increasing threshold of detection overlap with the ground truth
MAP_IncreasingMAP_SingleBestsubmission Is the localisation accuracy MAP using an increasing threshold of detection overlap with the ground truth showing only the best result from each group
PerConceptMAP_BBoxOverlap_0.5 is the per concept localisation accuracy given a 50% detection overlap with the groundtruth labels
PerConceptMAP_BBoxOverlap_0 is the per concept accuracy at a image wise level with no localsation
Subtask 2: Natural Language Caption Generation
subtask2_results: The average Meteor score across all test images. Also provided are the median, min and max scores.
Subtask 3: Content Selection
subtask3_results: The average F1 score, Precision and Recall across all 450 test images.
Teaser 1: Text Illustration
teaser1_results: Results are contained in the PDF file.
Submitting a working notes paper to CLEF
The next step is now to produce and submit a working notes paper of your method and system you evaluated. It is key to produce a paper even if your results are lower as it's important to publish methods that don't work well as well as methods that are successful. Authors are invited to submit using the LNCS proceedings format:
The CLEF 2016 working notes will be published in the CEUR-WS.org proceedings, facilitating the indexing by DBLP and the assignment of an individual DOI (URN) to each paper. According to the CEUR-WS policies, a light review of the working notes will be conducted by the task organizers to ensure quality.
Working notes will have to be submitted before 25th May 2016 11:59 pm - midnight - Central European Summer Time, through the easychair system. The working notes papers are technical reports written in English and describing the participating systems and the conducted experiments. To avoid redundancy, the papers should *not* include a detailed description of the actual task, data set and experimentation protocol. Instead of this, the papers are required to cite both the general ImageCLEF overview paper and the corresponding image annotation task overview paper. Bibtex references are available below. A general structure for the paper should provide at a minimum the following information:
Title
Authors
Affiliations
Email addresses of all authors
The body of the text. This should contain information on:
tasks performed
main objectives of experiments
approach(es) used and progress beyond state-of-the-art
resources employed
results obtained
analysis of the results
perspectives for future work
The paper should not exceed 12 pages, and further instructions on how to write and submit your working notes are on the following page:
25 28.05.2016 11:59 pm - midnight - Central European Summer Time: deadline for submission of working notes papers by the participants
17.06.2016: notification of acceptance of the working notes papers
01.07.2016: camera ready working notes papers
05.-08.09.2016: CLEF 2016, Évora, Portugal
Bibtex references
General Overview paper
@incollection{Villegas16_CLEF,
title={{General Overview of ImageCLEF at the CLEF 2016 Labs}},
author={Mauricio Villegas and
M\"uller, Henning and
Garc\'ia Seco de Herrera, Alba and
Schaer, Roger and
Bromuri, Stefano and
Andrew Gilbert and
Luca Piras and
Josiah Wang and
Fei Yan and
Arnau Ramisa and
Emmanuel Dellandrea and
Robert Gaizauskas and
Krystian Mikolajczyk
Joan Puigcerver and
Alejandro H. Toselli and
Joan-Andreu Sánchez and
Enrique Vidal},
booktitle = {},
year = {2016},
publisher = {Springer International Publishing},
series = {Lecture Notes in Computer Science},
issn = {0302-9743},
}
Task Overview paper
@inproceedings{Gilbert16_CLEF,
author = {Andrew Gilbert and
Luca Piras and
Josiah Wang and
Fei Yan and
Arnau Ramisa and
Emmanuel Dellandrea and
Robert Gaizauskas and
Mauricio Villegas and
Krystian Mikolajczyk},
title = {{Overview of the ImageCLEF 2016 Scalable Concept Image Annotation Task}},
booktitle = {CLEF2016 Working Notes},
series = {{CEUR} Workshop Proceedings},
year = {2016},
volume = {},
publisher = {CEUR-WS.org },
pages = {},
month = {September 5-8},
address = {\'Evora, Portugal},
}
The Scalable Concept Image Annotation Challenge is co-organized by the VisualSense (ViSen) consortium under the EU ERA-NET CHIST-ERAFrom Data to New Knowledge (D2K) 2011 Programme, jointly supported by UK EPSRC Grants EP/K01904X/1 & EP/K019082/1, French ANR Grant ANR-12-CHRI-0002-04 and Spanish MINECO Grant PCIN-2013-047.