You are here

MultimodalReasoning

Motivation

Vision-Language Models (VLMs) show impressive capabilities in tasks that require the integration of vision and language, such as image captioning, simple visual question answering, and visual dialogue. However, when it comes to their ability to reason effectively, VLMs struggle with deep logical reasoning or inferencing. They may have difficulty answering questions that require reasoning through complex dependencies or hypothetical scenarios.
The task's goal is to assess modern LLMs' reasoning capabilities on complex inputs, presented in different languages, across various subjects.

News

Training data is already publicly available as described in the Data section.

Test data will be released at a later date in accordance with the Schedule section.

Task Description

MultimodalReason is a new task focusing on Multilingual Visual Question Answering (VQA). The formulation of the task is the following:
Given an image of a question with 3-5 possible answers, participants must identify the single correct answer.

Dataset

Data

  • The training dataset for the task is available here: Exams-V
  • Important: This includes only training and dev/validation data split into 16,724 training and 4,208 dev/validation instances.
  • Test data will be made available later.

Dataset

Evaluation methodology

The official evaluation measure for the task will be accuracy.

Participant registration

Please refer to the general ImageCLEF registration instructions

Preliminary Schedule

  • 20.12.2024: Registration opens for all ImageCLEF tasks
  • 25.04.2025: Registration closes for all ImageCLEF tasks
  • 12.04.2025: Test data release
  • 10.05.2025: Deadline for submitting the participants runs
  • 17.05.2025: Release of the processed results by the task organizers
  • 30.05.2025: Deadline for submission of working notes papers by the participants
  • 27.06.2025: Notification of acceptance of the working notes papers
  • 07.07.2025: Camera ready working notes papers
  • 09-12.09.2025: CLEF 2025, Madrid, Spain

Submission Instructions

Follow the Participant Registration section to register in the evaluation platform.
The test set comprises questions in 13 different languages, some of which are not present in Exams-V data. Therefore, 14 leaderboards will be available—one for each language and one for multilingual submission. This is to allow teams that want to participate in all languages to do so in a single submission.

Results

CEUR Working Notes

For detailed instructions, please refer to this PDF file. A summary of the most important points:

  • All participating teams with at least one graded submission, regardless of the score, should submit a CEUR working notes paper.
  • Teams who participated in both tasks should generally submit only one report

Citations

Citation information for overview papers will be posted in this section later.

When referring to the training data please use the following citation:

  • Rocktim Das, Simeon Hristov, Haonan Li, Dimitar Dimitrov, Ivan Koychev, and Preslav Nakov. 2024. EXAMS-V: A Multi-Discipline Multilingual Multimodal Exam Benchmark for Evaluating Vision Language Models. In Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 7768–7791, Bangkok, Thailand. Association for Computational Linguistics.
  • BibTex:
    @inproceedings{das-etal-2024-exams,
    title = "{EXAMS}-{V}: A Multi-Discipline Multilingual Multimodal Exam Benchmark for Evaluating Vision Language Models",
    author = "Das, Rocktim and
    Hristov, Simeon and
    Li, Haonan and
    Dimitrov, Dimitar and
    Koychev, Ivan and
    Nakov, Preslav",
    editor = "Ku, Lun-Wei and
    Martins, Andre and
    Srikumar, Vivek",
    booktitle = "Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)",
    month = aug,
    year = "2024",
    address = "Bangkok, Thailand",
    publisher = "Association for Computational Linguistics",
    url = "https://aclanthology.org/2024.acl-long.420",
    doi = "10.18653/v1/2024.acl-long.420",
    pages = "7768--7791"
    }

Contact

Organizers:

  • Dimitar Dimitrov <mitko.bg.ss@gmail.com; ilijanovd@fmi.uni-sofia.bg>, Sofia University "St. Kliment Ohridski", Bulgaria
  • Rocktim Jyoti Das <Rocktim.JyotiDas@mbzuai.ac.ae>, Mohamed bin Zayed University of Artificial Intelligence (MBZUAI), Abu Dhabi, UAE
  • Zhuohan Xie <Zhuohan.xie@mbzuai.ac.ae>, Mohamed bin Zayed University of Artificial Intelligence (MBZUAI), Abu Dhabi, UAE
  • Hee Ming Shan, Singapore University of Technology and Design.
  • Preslav Nakov, Mohamed bin Zayed University of Artificial Intelligence (MBZUAI), Abu Dhabi, UAE
  • Ivan Koychev, Sofia University "St. Kliment Ohridski", Bulgaria

Contributors:

  • Nikolay Paev, Sofia University "St. Kliment Ohridski", Bulgaria
  • Georgi Georgiev, Sofia University "St. Kliment Ohridski", Bulgaria
  • Ali Mekky, Mohamed bin Zayed University of Artificial Intelligence (MBZUAI), Abu Dhabi, UAE
  • Rania Hossam, Mohamed bin Zayed University of Artificial Intelligence (MBZUAI), Abu Dhabi, UAE
  • Nurdaulet Mukhituly, Mohamed bin Zayed University of Artificial Intelligence (MBZUAI), Abu Dhabi, UAE
  • Akhmed Sakip, Mohamed bin Zayed University of Artificial Intelligence (MBZUAI), Abu Dhabi, UAE
  • Omar El Herraoui, Mohamed bin Zayed University of Artificial Intelligence (MBZUAI), Abu Dhabi, UAE

Acknowledgments

AttachmentSize
Image icon dataset_stats_tr.png157.48 KB
Image icon language_samples.png6.47 MB