![]() |
| Introduction |
| ImageCLEF's Wikipedia Retrieval task provides a testbed for the system-oriented evaluation of visual information retrieval from a collection of Wikipedia images. The aim is to investigate retrieval approaches in the context of a large and heterogeneous collection of images (similar to those encountered on the Web) that are searched for by users with diverse information needs. In 2010, ImageCLEF's Wikipedia Retrieval used a new collection of over 237,000 Wikipedia images that cover diverse topics of interest. These images are associated with unstructured and noisy textual annotations in English, French, and German. This is an ad-hoc image retrieval task; the evaluation scenario is thereby similar to the classic TREC ad-hoc retrieval task and the ImageCLEF photo retrieval task: simulation of the situation in which a system knows the set of documents to be searched, but cannot anticipate the particular topic that will be investigated (i.e. topics are not known to the system in advance). The goal of the simulation is: given a textual query (and/or sample images) describing a user's (multimedia) information need, find as many relevant images as possible from the Wikipedia image collection. Any method can be used to retrieve relevant documents. We encourage the use of both concept-based and content-based retrieval methods and, in particular, multi modal and - new this year - multi lingual approaches that investigate the combination of evidence from different modalities and language resources. |
| ImageCLEF 2010 Wikipedia Collection |
|
The ImageCLEF 2010 Wikipedia collection consists of 237,434 images and associated user-supplied annotations. The collection was built to cover similar topics in English, German and French. Topical similarity was obtained by selecting only Wikipedia articles which have versions in all three languages and are illustrated with at least one image in each version: 44,664 such articles were extracted from the September 2009 Wikipedia dumps, containing a total number of 265,987 images. Since the collection is intended to be freely distributed, we decided to remove all images with unclear copyright status. After this operation, duplicate elimination and some additional cleaning up, the remaining number of images in the collection is 237,434, with the following language distribution: -English only: 70,127 The main difference between the ImageCLEF 2010 Wikipedia collection and the INEX MM collection (Westerveld and van Zwol, 2007) used in the previous WikipediaMM tasks is that the multilingual aspect has been reinforced and both mono- and cross-lingual evaluations can be carried out. Another difference is that this year, participants will receive for each image both its user-provided annotation and also links to the article(s) which contain the image. Finally, in order to encourage multi modal approaches, three types of low-level image features were extracted using PIRIA, CEA LIST's image indexing tool (Joint et al., 2004) and are provided to all participants. (Joint et al., 2004) M. Joint, P.-A. Moëllic, P. Hède, P. Adam. PIRIA: a general tool for indexing, search and retrieval of multimedia content In Proceedings of SPIE, 2004. |
|
Two examples that illustrate the images in the collection and their metadata are provided below:
|
|
DOWNLOAD The data are no longer available here; they can be downloaded from the Resources page for the ImageCLEF Wikipedia Image Retrieval Datasets: HERE. Search Engine: Cross-Modal Search Engine (CMSE by UniGe) that allows you to search the ImageCLEF 2010 Wikipedia image collection through a web interface using text queries, example images or both at once. |
| Evaluation Objectives |
The characteristics of the new Wikipedia collection allow for the investigation of the following objectives:
In the context of INEX MM 2006-2007, mainly text-based retrieval approaches have been examined. Here, we hope to attract more visually-oriented approaches and most importantly, multi modal and multi lingual approaches that investigate the combination of evidence from different modalities and languages. The results of WikipediaMM at ImageCLEF 2008/2009 showed that multimedia retrieval approaches outperformed for certain topics the text-based approaches, but globally the retrieval based on text remains unbeaten. The retrieval of multimedia documents will stay in the focus of attention for 2010. This year, a second focus will be the effectiveness of multi lingual approaches for multimedia document retrieval. |
| Topics |
|
The topics for ImageCLEF 2010 Wikipedia Retrieval task include (i) topics based on analysis of a search engine's logs, and (ii) topics used in previous years. DOWNLOAD (participants only)
The topics are multimedia queries that can consist of a textual and a visual part. Concepts that might be needed to constrain the results should be added to the title field. An example topic in the appropriate format is the following: <topic> </topic> Therefore, the topics include the following fields:
|
| Retrieval Experiments | ||||||||||||||||
|
Experiments are performed as follows: the participants are given topics, these are used to create a query which is used to perform retrieval on the image collection. This process iterates (e.g. maybe involving relevance feedback) until they are satisfied with their runs. Participants might try different methods to increase the number of relevant in the top N rank positions (e.g., query expansion). Participants are free to experiment with whatever methods they wish for image retrieval, e.g., query expansion based on thesaurus lookup or relevance feedback, indexing and retrieval on only part of the image caption, different models of retrieval, and combining text and content-based methods for retrieval. Given the many different possible approaches which could be used to perform the ad-hoc retrieval, rather than list all of these we ask participants to indicate which of the following applies to each of their runs (we consider these the "main" dimensions which define the query for this ad-hoc task):
Annotation language: Comment: Topic language: Run type: Feedback or Query Expansion: Retrieval type (Modality): Topic field: |
| Submissions |
|
Participants can submit up to 20 system runs. The submission system is now open at the ImageCLEF registration system (select Runs > Submit a Run). Participants are required to submit ranked lists of (up to) the top 1000 images ranked in descending order of similarity (i.e. the highest nearer the top of the list). The format of submissions for this ad-hoc task is the TREC format. It can be found here. Please note that there should be at least 1 document entry in your results for each topic (i.e. if your system returns no results for a query then insert a dummy entry, e.g. 25 1 16019 0 4238 xyzT10af5 ). The reason for this is to make sure that all systems are compared with the same number of topics and relevant documents. Submissions not following the required format will not be evaluated. Information to be provided during submission
|
| Notebook papers |
|
The full Notebook papers are posted on the CLEF 2010 website. The printed Book of Extended Abstracts was distributed at the CLEF 2010 conference. Both Notebook papers and Book of Abstracts were assigned an ISBN number (978-88-904810-0-0). The papers are also indexed by DBLP. |
| Schedule |
The schedule can be found here:
|
| Organisers |
|
| Attachment | Size |
|---|---|
| 331.05 KB |