ImageCLEF Wikipedia Image Retrieval Datasets

1. Introduction

The Wikipedia image retrieval task is an ad-hoc image retrieval task. The overall goal of the task is to investigate how well multi-modal image retrieval approaches that combine textual and visual evidence in order to satisfy a user’s multimedia information need could deal with larger scale image collections that contain highly heterogeneous items both in terms of their textual descriptions and their visual content. The aim is to simulate image retrieval in a realistic setting, such as the Web environment, where available images cover highly diverse subjects and have highly varied visual properties, while their accompanying textual metadata (if any) are user-generated and correspond to noisy and unstructured textual descriptions of varying quality and length.

The Wikipedia image retrieval task ran as part of ImageCLEF for four years: 2008-2011.

2. Datasets

Two collections of Wikipedia images were used during the four years of the task: the Wikipedia INEX Multimedia Collection consisting of 151,519 images in 2008 and 2009, and the Wikipedia Retrieval 2010 Collection consisting of 237,434 images in 2010 and 2011. A number of topics were developed in order to respond to diverse multimedia information needs; there were 75 topics in 2008, 45 in 2009, 70 in 2010, 50 in 2011. The ground truth for these topics was created by assuming binary relevance (relevant vs. non relevant) and by assessing only the images in the pools created by the retrieved images contained in the runs submitted by the participants each year; a pool depth of 100 was used in 2008, 2010, and 2011, and a pool depth of 50 in 2009.

3. How to acquire the datasets

To obtain access to the ImageCLEF Wikipedia Image Retrieval datasets, please follow these steps:

Register to the dataset management system (funded by CHORUS+).
Select the dataset you would like to access. Currently, only the ImageCLEF Wikipedia Image Retrieval 2010-2011 dataset is available.
You will then be able to see the access details for downloading the data in the detailed view of the dataset in the system. Use these access details to download the datasets provided below.
By downloading the ImageCLEF Wikipedia Image Retrieval 2010-2011 dataset, you (the END USER) agree to this.

For any other inquiries, send an email to Theodora Tsikrika, University of Applied Sciences Western Switzerland (HES-SO), Sierre, Switzerland, theodora.tsikrika@acm.org

4. Downloading the datasets

Test collection 2008-2009 (Not available yet. To be provided soon.)
- Wikipedia INEX Multimedia Collection: 151,519 images + user-generated textual annotations
- 2008: 75 topics + ground truth
- 2009: 45 topics + ground truth
Test collection 2010-2011
- Wikipedia Retrieval 2010 Collection: This collection consists of 237,434 images, their associated user-generated textual annotations (i.e., the images' textual descriptions extracted from the Wikimedia Commons files and the images' captions in the Wikipedia article(s) that contain them), and the Wikipedia articles containing the images.
  The collection was built to cover similar topics in English, German and French, with the following language distribution for their associated textual annotations:
  - English only: 70,127
  - German only: 50,291
  - French only: 28,461
  - English and German: 26,880
  - English and French: 20,747
  - German and French: 9,646
  - English, German and French: 22,899
  - Language undetermined: 8,144
  - No textual annotation: 239
  
  An example that illustrates an image in the collection and its associated textual annotations is provided below:
  - A README file describing the provided data can be downloaded: HERE.
  - The ImageCLEF 2010 Wikipedia image collection (237,434 .jpeg and .png images - 21GB) can be downloaded in small batches: HERE.
  - A .zip file containing the user-generated textual annotations (metadata) of the images in the collection, the Wikipedia articles containing these images, and an id.txt file listing all image identifiers can be downloaded HERE.
  - Low-level visual features:
    - The CIME, TLEP, and SURF features of the images in the collection can be downloaded: HERE (provided by CEA LIST, France).
    - The CEDD features of the images in the collection can be downloaded: HERE (provided by the Database & Information Retrieval Unit, Department of Electrical and Computer Engineering, Democritus University of Thrace, Greece).
- Topics and ground truth: The topics are descriptions of multimedia information needs that contain textual and visual hints. There were 70 topics in 2010 and 50 topics in 2011.
  - 2010 topics: 70 topics + ground truth (Not available yet. To be provided soon.)
  - 2011 topics:

5. Related Publications

Theodora Tsikrika, Adrian Popescu, and Jana Kludas.Overview of the Wikipedia Image Retrieval task at ImageCLEF 2011.In the Working Notes for the CLEF 2011 Labs and Workshop, 19-22 September, Amsterdam, The Netherlands, 2011.
Adrian Popescu, Theodora Tsikrika and Jana Kludas.Overview of the wikipedia retrieval task at ImageCLEF 2010.In the Working Notes for the CLEF 2010 Workshop, 20-23 September, Padova, Italy, 2010.
Theodora Tsikrika and Jana Kludas. The Wikipedia Image Retrieval Task. In: ImageCLEF - Experimental evaluation in visual information retrieval. The Information Retrieval Series, Vol. 32, Springer, 2010.
Theodora Tsikrika and Jana Kludas.
Overview of the wikipediaMM task at ImageCLEF 2009.
In Multilingual Information Access Evaluation Vol. II Multimedia Experiments -- Proceedings of the 10th Workshop of the Cross-Language Evaluation Forum (CLEF 2009), Revised Selected Papers, Lecture Notes in Computer Science, Springer, 2010.
Theodora Tsikrika and Jana Kludas. Overview of the wikipediaMM task at ImageCLEF 2008. In Evaluating Systems for Multilingual and Multimodal Information Access -- Proceedings of the 9th Workshop of the Cross-Language Evaluation Forum (CLEF 2008), Revised Selected Papers, Lecture Notes in Computer Science, volume 5706, pp. 539-550, Springer, 2009.

Navigation

You are here

1. Introduction

2. Datasets

3. How to acquire the datasets

4. Downloading the datasets

5. Related Publications