MultimodalReasoning

Motivation

Vision-Language Models (VLMs) show impressive capabilities in tasks that require the integration of vision and language, such as image captioning, simple visual question answering, and visual dialogue. However, when it comes to their ability to reason effectively, VLMs struggle with deep logical reasoning or inferencing. They may have difficulty answering questions that require reasoning through complex dependencies or hypothetical scenarios.
The task's goal is to assess modern LLMs' reasoning capabilities on complex inputs, presented in different languages, across various subjects.

News

Training data is already publicly available as described in the Data section.

Test data will be released at a later date in accordance with the Schedule section.

Task Description

MultimodalReason is a new task focusing on Multilingual Visual Question Answering (VQA). The formulation of the task is the following:
Given an image of a question with 3-5 possible answers, participants must identify the single correct answer.

Dataset

Data

The training dataset for the task is available here: Exams-V
Important: This includes only training and dev/validation data split into 16,724 training and 4,208 dev/validation instances.
Test data will be made available later.

Dataset

Evaluation methodology

The official evaluation measure for the task will be accuracy.

Participant registration

Please refer to the general ImageCLEF registration instructions

Preliminary Schedule

20.12.2024: Registration opens for all ImageCLEF tasks
25.04.2025: Registration closes for all ImageCLEF tasks
12.04.2025: Test data release
10.05.2025: Deadline for submitting the participants runs
17.05.2025: Release of the processed results by the task organizers
30.05.2025: Deadline for submission of working notes papers by the participants
27.06.2025: Notification of acceptance of the working notes papers
07.07.2025: Camera ready working notes papers
09-12.09.2025: CLEF 2025, Madrid, Spain

Submission Instructions

Follow the Participant Registration section to register in the evaluation platform.
The test set comprises questions in 13 different languages, some of which are not present in Exams-V data. Therefore, 14 leaderboards will be available—one for each language and one for multilingual submission. This is to allow teams that want to participate in all languages to do so in a single submission.

Results

CEUR Working Notes

For detailed instructions, please refer to this PDF file. A summary of the most important points:

All participating teams with at least one graded submission, regardless of the score, should submit a CEUR working notes paper.
Teams who participated in both tasks should generally submit only one report

Citations

Citation information for overview papers will be posted in this section later.

When referring to the training data please use the following citation:

Rocktim Das, Simeon Hristov, Haonan Li, Dimitar Dimitrov, Ivan Koychev, and Preslav Nakov. 2024. EXAMS-V: A Multi-Discipline Multilingual Multimodal Exam Benchmark for Evaluating Vision Language Models. In Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 7768–7791, Bangkok, Thailand. Association for Computational Linguistics.
BibTex:
@inproceedings{das-etal-2024-exams,
title = "{EXAMS}-{V}: A Multi-Discipline Multilingual Multimodal Exam Benchmark for Evaluating Vision Language Models",
author = "Das, Rocktim and
Hristov, Simeon and
Li, Haonan and
Dimitrov, Dimitar and
Koychev, Ivan and
Nakov, Preslav",
editor = "Ku, Lun-Wei and
Martins, Andre and
Srikumar, Vivek",
booktitle = "Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)",
month = aug,
year = "2024",
address = "Bangkok, Thailand",
publisher = "Association for Computational Linguistics",
url = "https://aclanthology.org/2024.acl-long.420",
doi = "10.18653/v1/2024.acl-long.420",
pages = "7768--7791"
}

Contact

Organizers:

Dimitar Dimitrov <mitko.bg.ss@gmail.com; ilijanovd@fmi.uni-sofia.bg>, Sofia University "St. Kliment Ohridski", Bulgaria
Rocktim Jyoti Das <Rocktim.JyotiDas@mbzuai.ac.ae>, Mohamed bin Zayed University of Artificial Intelligence (MBZUAI), Abu Dhabi, UAE
Zhuohan Xie <Zhuohan.xie@mbzuai.ac.ae>, Mohamed bin Zayed University of Artificial Intelligence (MBZUAI), Abu Dhabi, UAE
Hee Ming Shan, Singapore University of Technology and Design.
Sarfraz Ahmad, Mohamed bin Zayed University of Artificial Intelligence (MBZUAI), Abu Dhabi, UAE
Momina Ahsan, Mohamed bin Zayed University of Artificial Intelligence (MBZUAI), Abu Dhabi, UAE
Preslav Nakov, Mohamed bin Zayed University of Artificial Intelligence (MBZUAI), Abu Dhabi, UAE
Ivan Koychev, Sofia University "St. Kliment Ohridski", Bulgaria

Contributors:

Nikolay Paev, Sofia University "St. Kliment Ohridski", Bulgaria
Georgi Georgiev, Sofia University "St. Kliment Ohridski", Bulgaria
Ali Mekky, Mohamed bin Zayed University of Artificial Intelligence (MBZUAI), Abu Dhabi, UAE
Rania Hossam, Mohamed bin Zayed University of Artificial Intelligence (MBZUAI), Abu Dhabi, UAE
Nurdaulet Mukhituly, Mohamed bin Zayed University of Artificial Intelligence (MBZUAI), Abu Dhabi, UAE
Akhmed Sakip, Mohamed bin Zayed University of Artificial Intelligence (MBZUAI), Abu Dhabi, UAE
Omar El Herraoui, Mohamed bin Zayed University of Artificial Intelligence (MBZUAI), Abu Dhabi, UAE

Acknowledgments

Attachment	Size
dataset_stats_tr.png	157.48 KB
language_samples.png	6.47 MB

Navigation

You are here