Automatic recognition of photographic content is useful in a wide range of domains, ranging from specialized application, such as medical imagery, to large public applications, such as web content structuring and retrieval. Although considerable research efforts have been devoted to concept and event detection in public and private images, this task remains difficult because the number of possible elements that can be depicted is boundless, where their visual aspects furthermore can vary along numerous dimensions.
The ImageCLEF's Photo Annotation and Retrieval task aims to advance the state of the art in multimedia research by providing a challenging benchmark for visual concept detection, annotation and retrieval in the context of diverse collections of photos. The benchmark consists of two subtasks, where the objective of the first is to accurately detect a wide range of semantic concepts for the purpose of scalable automatic image annotation on a large collection of web images, whereas the objective of second subtask is to correctly retrieve relevant images from personal photo collections based on typical scenarios in which a user want to find some of their own photos according to certain criteria.
Subtask 1: Scalable Concept Image Annotation
Image concept detection generally has relied on training data that has been manually, and thus reliably annotated, an expensive and laborious endeavor that cannot easily scale. To address this issue, this year's annotation task will concentrate exclusively on developing annotation systems that rely only on automatically obtained web data. A very large amount of images can be cheaply gathered from the web, and furthermore, from the webpages that contain the images, text associated with them can be obtained. However, the degree of relationship between the surrounding text and the image varies greatly. Figures 1 and 2 show some image examples retrieved from a search engine for a couple of queries, and it can be observed that there are images that do not have any apparent relationship with the intended concept. Moreover, the webpages can be of any language or even a mixture of languages, and they tend to have many writing mistakes. Overall the data can be considered to be very noisy.
Figure 1. Images from a web search query of "rainbow".
Figure 2. Images from a web search query of "sun".
The goal of this subtask is to evaluate different strategies to deal with the noisy data so that it can be reliably used for annotating images from practically any topic.
In this subtask, the objective is to develop systems that can easily change or scale the list of concepts used for image annotation. In other words, the list of concepts can also be considered to be an input to the system. Thus the system when given an input image and a list of concepts, its job is to give a score to each of the concepts in the list and decide how many of them assign as annotations. To observe this scalable characteristic of the systems, the list of concepts will be different for the development and test sets.
As described earlier, the training set does not include manually annotated concepts, only textual features obtained from the webpages where the images appeared. It is not permitted to use any labeled data for training the systems. Although a strategy could be to use the textual features to decide which concepts are present and artificially label the provided training data. On the other hand, the use of additional language resources, such as language models, language detectors, stemmers, Wordnet, is permitted and encouraged.
For further details on this subtask, please click here.
Subtask 2: Personal Photo Retrieval
Although obtaining large amounts of images from the web has become very easy, the actual contents of personal photo collections remain a blind spot of research. Personal photo collections contain a lot of duplicate images, and vary in quality and image content. To adress this issue, this year's retrieval task will concentrate on the retrieval from such a collection that has been sampled from real photographers with a demographic span that models a lifetime's photo collection. Besides the noisy content in a personal photo collection, an objective of this task is to find out whether the participating retrieval systems can exploit data from different search strategies, i.e., query-by-example and browsing data, in order to find both visual concepts (see Fig. 3) and photos depicting events (see Fig. 4). This discrimination is motivated by a conducted user study that was accompanying the collection of the personal photos.
Figure 3. Samples of the Visual Concept "Asian Temple Interior".
Figure 4. Samples of the Event Class "Rock Concert".
This year's subtask will extend the pilot task of 2012 with a focus on different usage scenarios and user groups. That is, the subtask will reveal if the tested algorithms are stable in terms of retrieval quality for different user groups. This becomes possible because each image's relevance has been judged by multiple assessors on a gradual scale.
The subtask will be ad-hoc, i.e., no additional training data is released. The participants will get multiple QBE documents and/or browsing data and will have to find the best matching documents illustrating an event or depicting a visual concept.
For details about the subtask, see the task's website.
- Bart Thomee, Yahoo! Research, Barcelona, Spain, bthomee[at]yahoo-inc.com
- Mauricio Villegas, PRHLT, Universidad Politécnica de Valencia, Spain, mauvilsa[at]upv.es
- Roberto Paredes, PRHLT, Universidad Politécnica de Valencia, Spain, rparedes[at]dsic.upv.es
- David Zellhöfer, Brandenburg University of Technology, Germany, david.zellhoefer[at]tu-cottbus.de