Welcome to the 8th edition of the Caption Task!
Motivation
Interpreting and summarizing the insights gained from medical images such as radiology output is a time-consuming task that involves highly trained experts and often represents a bottleneck in clinical diagnosis pipelines.
Consequently, there is a considerable need for automatic methods that can approximate this mapping from visual information to condensed textual descriptions. The more image characteristics are known, the more structured are the radiology scans and hence, the more efficient are the radiologists regarding interpretation. We work on the basis of a large-scale collection of figures from open access biomedical journal articles (PubMed Central). All images in the training data are accompanied by UMLS concepts extracted from the original image caption.
Lessons learned:
- In the first and second editions of this task, held at ImageCLEF 2017 and ImageCLEF 2018, participants noted a broad variety of content and situation among training images. In 2019, the training data was reduced solely to radiology images, with ImageCLEF 2020 adding additional imaging modality information, for pre-processing purposes and multi-modal approaches.
- The focus in ImageCLEF 2021 lay in using real radiology images annotated by medical doctors. This step aimed at increasing the medical context relevance of the UMLS concepts, but more images of such high quality are difficult to acquire.
- As uncertainty regarding additional source was noted, we will clearly separate systems using exclusively the official training data from those that incorporate additional sources of evidence
- For ImageCLEF 2022, an extended version of the ImageCLEF 2020 dataset was used. For the caption prediction subtask, a number of different additional evaluations metrics were introduced with the goal of replacing the primary evaluation metric in future iterations of the task.
- For ImageCLEF 2023, several issues with the dataset (large number of concepts, lemmatization errors, duplicate captions) were tackled and based on experiments in the previous year, BERTScore was used as the primary evaluation metric for the caption prediction subtask.
News
- 24.10.2023: website goes live
- 15.12.2023: registration opens
- 16.02.2024: development dataset released
- 01.04.2024: test dataset released
- 02.05.2024: run submission deadline extended to May 13
- 13.05.2024: run submission phase ended
- 17.05.2024: results published
- 31.05.2024: submission of participant papers [CEUR-WS]
- 21.06.2024: notification of acceptance
Task Description
For captioning, participants will be requested to develop solutions for automatically identifying individual components from which captions are composed in Radiology Objects in COntext version 2[2] images.
ImageCLEFmedical Caption 2024 consists of two substaks:
- Concept Detection Task
- Caption Prediction Task
Concept Detection Task
The first step to automatic image captioning and scene understanding is identifying the presence and location of relevant concepts in a large corpus of medical images. Based on the visual image content, this subtask provides the building blocks for the scene understanding step by identifying the individual components from which captions are composed. The concepts can be further applied for context-based image and information retrieval purposes.
Evaluation is conducted in terms of set coverage metrics such as precision, recall, and combinations thereof.
Caption Prediction Task
On the basis of the concept vocabulary detected in the first subtask as well as the visual information of their interaction in the image, participating systems are tasked with composing coherent captions for the entirety of an image. In this step, rather than the mere coverage of visual concepts, detecting the interplay of visible elements is crucial for strong performance.
As an optional, experimental explainability extension, participants will be asked to provide explanations (e.g., heatmaps, Shapley values) for a small subset of images which will be manually evaluated.
This year, we will use BERTScore as the primary evaluation metric and ROUGE as the secondary evaluation metric for the caption prediction subtask. Other metrics such as MedBERTScore, MedBLEURT, and BLEU will also be published.
Data
The data for the caption task will contain curated images from the medical literature including their captions and associated UMLS terms that are manually controlled as metadata. A more diverse data set will be made available to foster more complex approaches.
For the development dataset, Radiology Objects in COntext Version 2 (ROCOv2) [2], an updated and extended version of the Radiology Objects in COntext (ROCO) dataset [1], is used for both subtasks. As in previous editions, the dataset originates from biomedical articles of the PMC OpenAccess subset, with the test set comprising a previously unseen set of images.
Training Set: Consists of 70,108 radiology images
Validation Set: Consists of 9972 radiology images
Test Set: Consists of 17,237 radiology images
Concept Detection Task
The concepts were generated using a reduced subset of the UMLS 2022 AB release. To improve the feasibility of recognizing concepts from the images, concepts were filtered based on their semantic type. Concepts with low frequency were also removed, based on suggestions from previous years.
Caption Prediction Task
For this task each caption is pre-processed in the following way:
- removal of links from the captions
Evaluation methodology
For assessing performance, classic metrics will be used, ranging from F1 score, accuracy, and BERTScore or other measures like ROUGE to measure text similarity.
The source code of the evaluation script is available at Sciebo.
Concept Detection
Evaluation is conducted in terms of F1 scores between system predicted and ground truth concepts, using the following methodology and parameters:
- The default implementation of the Python scikit-learn (v0.17.1-2) F1 scoring method is used. It is documented here.
- A Python (3.x) script loads the candidate run file, as well as the ground truth (GT) file, and processes each candidate-GT concept sets
- For each candidate-GT concept set, the y_pred and y_true arrays are generated. They are binary arrays indicating for each concept contained in both candidate and GT set if it is present (1) or not (0).
- The F1 score is then calculated. The default 'binary' averaging method is used.
- All F1 scores are summed and averaged over the number of elements in the test set, giving the final score.
The ground truth for the test set was generated based on the same reduced subset of the UMLS 2022 AB release which was used for the training data (see above for more details).
Caption Prediction
This year, BERTScore is used as the primary metric, and ROUGE is used as a secondary metric. Other metrics will be reported after the challenge concludes.
For all metrics, each caption is pre-processed in the same way:
- The caption is converted to lower-case
- Replace numbers with the token 'number'
- Remove punctuation
BERTScore is calculated using the following methodology and parameters:
The native Python implementation of BERTScore is used, which can be found in the GitHub repository here. This scoring method is based on the paper "BERTScore: Evaluating Text Generation with BERT" and aims to measure the quality of generated text by comparing it to a reference.
To calculate BERTScore, we use the microsoft/deberta-xlarge-mnli model, which can be found on the Hugging Face Model Hub. The model is pretrained on a large corpus of text and fine-tuned for natural language inference tasks. It can be used to compute contextualized word embeddings, which are essential for BERTScore calculation.
To compute the final BERTScore, we first calculate the individual score (F1) for each sentence. Then all BERTScores are summed and averaged over the number of captions, yielding the final score.
The ROUGE scores are calculated using the following methodology and parameters:
- The native python implementation of ROUGE scoring method is used. It is designed to replicate results from the original perl package that was introduced in the original article describing the ROUGE evaluation method.
- Specifically, we calculate the ROUGE-1 (F-measure) score, which measures the number of matching unigrams between the model-generated text and a reference.
- A Python (3.7) script loads the candidate run file, as well as the ground truth (GT) file, and processes each candidate-GT caption pair
- The ROUGE score is then calculated. Note that the caption is always considered as a single sentence, even if it actually contains several sentences.
- All ROUGE scores are summed and averaged over the number of captions, giving the final score.
Participant registration
Please refer to the general ImageCLEF registration instructions
Preliminary Schedule
- 30.11.2023: Registration opens
- 16.02.2024: Release of the training and validation sets
- 01.04.2024: Release of the test sets
- 22.04.2024: Registration closes
-
06.05.2024 13.05.2024: Run submission deadline
-
13.05.2024 17.05.2024: Release of the processed results by the task organizers
- 31.05.2024: Submission of participant papers [CEUR-WS]
- 21.06.2024: Notification of acceptance
- 08.07.2024: Camera ready copy of participant papers and extended lab overviews [CEUR-WS]
- 09-12.09.2024: CLEF 2024, Grenoble, France
Submission Instructions
Please find the submission instructions on the challenge platform (tab "Files"):
Please note that starting in 2024, participants will be required to upload their code (not the models) in addition to the predicted test set labels.
Results
The tables below contain only the best runs of each team, for a complete list of all runs please see the CSV files in this Sciebo folder.
Concept Detection Task
For the cconcept detection task, the ranking is based on the F1-score as described in the Evaluation Methodologies section above. Additionally, a F1-score Manual was calculated using a subset of manually validated concepts (anatomy (X-ray), and image modality (all images)) only.
Team |
owner |
id |
score |
score_secondary |
DBS-HHU |
hekau101 |
601 |
0.637466 |
0.953425 |
auebnlpgroup |
auebnlpgroup |
644 |
0.631908 |
0.939278 |
DS@BioMed |
triettm |
653 |
0.619977 |
0.931173 |
SSNMLRGKSR |
ssnmlrgksr |
425 |
0.600061 |
0.905603 |
UACH-VisionLab |
graciela |
235 |
0.598764 |
0.936313 |
MICLabNM |
dscarmo |
681 |
0.579545 |
0.883523 |
Kaprov |
pradeepkmaran |
558 |
0.460925 |
0.730118 |
VIT_Conceptz |
lekshmiscopevit |
233 |
0.181204 |
0.264671 |
CS_Morgan |
csmorgan |
530 |
0.107645 |
0.210548 |
Caption Prediction Task
For the caption prediction task, the ranking is based on the BERTScore.
Team |
ID |
BERTScore |
ROUGE |
BLEU-1 |
BLEURT |
METEOR |
CIDEr |
CLIPScore |
RefCLIPScore |
ClinicalBLEURT |
MedBERTScore |
pclmed |
634 |
0.629913 |
0.272626 |
0.268994 |
0.337626 |
0.113264 |
0.268133 |
0.823614 |
0.817610 |
0.466557 |
0.632318 |
CS_Morgan |
429 |
0.628059 |
0.250801 |
0.209298 |
0.317385 |
0.092682 |
0.245029 |
0.821262 |
0.815534 |
0.455942 |
0.632664 |
DarkCow |
220 |
0.626720 |
0.245228 |
0.195044 |
0.306005 |
0.088897 |
0.224250 |
0.818440 |
0.811700 |
0.456199 |
0.629189 |
auebnlpgroup |
630 |
0.621112 |
0.204883 |
0.111034 |
0.289907 |
0.068022 |
0.176923 |
0.804067 |
0.798684 |
0.486560 |
0.626134 |
2Q2T |
643 |
0.617814 |
0.247755 |
0.221252 |
0.313942 |
0.098590 |
0.220037 |
0.827074 |
0.813756 |
0.475908 |
0.622447 |
MICLab |
678 |
0.612850 |
0.213525 |
0.185269 |
0.306743 |
0.077181 |
0.158239 |
0.815925 |
0.804924 |
0.445257 |
0.617195 |
DLNU_CCSE |
674 |
0.606578 |
0.217857 |
0.151179 |
0.283133 |
0.070419 |
0.168765 |
0.796707 |
0.790424 |
0.475625 |
0.612954 |
Kaprov |
559 |
0.596362 |
0.190497 |
0.169726 |
0.295109 |
0.060896 |
0.107017 |
0.792183 |
0.787201 |
0.439971 |
0.608924 |
DS@BioMed |
571 |
0.579438 |
0.103095 |
0.012144 |
0.220211 |
0.035335 |
0.071529 |
0.775566 |
0.774823 |
0.529529 |
0.580388 |
DBS-HHU |
637 |
0.576891 |
0.153103 |
0.149275 |
0.270965 |
0.055929 |
0.064361 |
0.784199 |
0.774985 |
0.476634 |
0.582744 |
KDE-medical-caption |
557 |
0.567329 |
0.132496 |
0.106025 |
0.256576 |
0.038628 |
0.038404 |
0.765059 |
0.760958 |
0.502234 |
0.569659 |
CEUR Working Notes
For detailed instructions, please refer to this PDF file. A summary of the most important points:
- All participating teams with at least one graded submission, regardless of the score, should submit a CEUR working notes paper.
- Teams who participated in both tasks should generally submit only one report
- Submission of reports is done through EasyChair – please make absolutely sure that the author (names and order), title, and affiliation information you provide in EasyChair match the submitted PDF exactly!
- Strict deadline for Working Notes Papers: 31 May 2024 (23:59 CEST)
- Strict deadline for CEUR-WS Camera Ready Working Notes Papers: 08 July 2024 (23:59 CEST)
- Make sure to include the signed Copyright Form when uploading the Camera Ready Working Note Papers
- Templates are available here
- Working Notes Papers should cite both the ImageCLEF 2024 overview paper as well as the ImageCLEFmedical task overview paper and the ROCOv2 dataset paper, citation information is available in the Citations section below.
Citations
When referring to the development dataset used for ImageCLEFmedical 2024 Caption, please cite the following arxiv preprint:
- Johannes Rückert, Louise Bloch, Raphael Brüngel, Ahmad Idrissi-Yaghir, Henning Schäfer, Cynthia S. Schmidt, Sven Koitka, Obioma Pelka, Asma Ben Abacha, Alba G. Seco de Herrera, Henning Müller, Peter A. Horn, Felix Nensa and Christoph M. Friedrich (2024). ROCOv2: Radiology objects in COntext version 2, an updated multimodal image dataset. In arXiv [eess.IV]. http://arxiv.org/abs/2405.10004
- BibTex:
@article{rocov2,
title={{ROCOv2}: {Radiology Objects in COntext} Version 2, an Updated Multimodal Image Dataset},
author={Johannes Rückert and Louise Bloch and Raphael Brüngel and Ahmad Idrissi-Yaghir and Henning Schäfer and Cynthia S. Schmidt and Sven Koitka and Obioma Pelka and Asma Ben Abacha and Alba G. Seco de Herrera and Henning Müller and Peter A. Horn and Felix Nensa and Christoph M. Friedrich},
year={2024},
journal={Scientific Data},
doi={10.1038/s41597-024-03496-6},
url={https://arxiv.org/abs/2405.10004v1}
}
Citation information for the two overview papers will follow soon.
When referring to ImageCLEF 2024, please cite the following publication:
- Bogdan Ionescu, Henning Müller, Ana-Maria Drăgulinescu, Johannes
Rückert, Asma Ben Abacha, Alba G. Seco de Herrera, Louise Bloch,
Raphael Brüngel, Ahmad Idrissi-Yaghir, Henning Schäfer, Cynthia
Sabrina Schmidt, Tabea M. G. Pakull, Hendrik Damm, Benjamin Bracke,
Christoph M. Friedrich, Alexandra-Georgiana Andrei, Yuri Prokopchuk,
Dzmitry Karpenka, Ahmedkhan Radzhabov, Vassili Kovalev, Cécile
Macaire, Didier Schwab, Benjamin Lecouteux, Emmanuelle
Esperança-Rodier, Wen-wai Yim, Yujuan Fu, Zhaoyi Sun, Meliha Yetisgen,
Fei Xia, Steven A. Hicks, Michael A. Riegler, Vajira Thambawita,
Andrea Storås, Pål Halvorsen, Maximilian Heinrich, Johannes Kiesel,
Martin Potthast, Benno Stein, Overview of the ImageCLEF 2024:
Multimedia Retrieval in Medical Applications, in Experimental IR Meets
Multilinguality, Multimodality, and Interaction. Proceedings of the
15th International Conference of the CLEF Association (CLEF 2024),
Springer Lecture Notes in Computer Science LNCS, University of
Grenoble Alpes, France, 9-12 September, 2024.
- BibTex: @inproceedings{ImageCLEF2024,
author = {Bogdan Ionescu and Henning M\"uller and Ana{-}Maria
Dr\u{a}gulinescu and Johannes R\"uckert and Asma {Ben Abacha} and Alba
{Garc\’{\i}a Seco de Herrera} and Louise Bloch and Raphael Br\"ungel
and Ahmad Idrissi{-}Yaghir and Henning Sch\"afer and Cynthia Sabrina
Schmidt and Tabea M. G. Pakull and Hendrik Damm and Benjamin Bracke and
Christoph M. Friedrich and Alexandra{-}Georgiana Andrei and Yuri
Prokopchuk and Dzmitry Karpenka and Ahmedkhan Radzhabov and Vassili
Kovalev and C\'ecile Macaire and Didier Schwab and Benjamin Lecouteux
and Emmanuelle Esperan\c{c}a{-}Rodier and Wen{-}wai Yim and Yujuan Fu
and Zhaoyi Sun and Meliha Yetisgen and Fei Xia and Steven A. Hicks and
Michael A. Riegler and Vajira Thambawita and Andrea Stor\r{a}s and
P\r{a}l Halvorsen and Maximilian Heinrich and Johannes Kiesel and
Martin Potthast and Benno Stein},
title = {{Overview of ImageCLEF 2024}: Multimedia Retrieval in Medical
Applications},
booktitle = {Experimental IR Meets Multilinguality, Multimodality, and
Interaction},
series = {Proceedings of the 15th International Conference of the CLEF
Association (CLEF 2024)},
year = {2024},
publisher = {Springer Lecture Notes in Computer Science LNCS},
pages = {},
month = {September 9-12},
address = {Grenoble, France}
}
When referring to ImageCLEFmedical 2024 Caption general goals, general results, etc. please cite the following publication:
- Johannes Rückert, Asma Ben Abacha, Alba G. Seco de Herrera, Louise Bloch, Raphael Brüngel, Ahmad Idrissi-Yaghir, Henning Schäfer, Benjamin Bracke, Hendrik Damm, Tabea M. G. Pakull, Cynthia Sabrina Schmidt, Henning Müller and Christoph M. Friedrich. Overview of ImageCLEFmedical 2024 – Caption Prediction and Concept Detection. CEUR Workshop Proceedings (CEUR-WS.org), Grenoble, France, September 9-12, 2024.
- BibTex:
@inproceedings{ImageCLEFmedicalCaptionOverview2024,
author = {R\"uckert, Johannes and Ben Abacha, Asma and G. Seco de Herrera, Alba and Bloch, Louise and Br\"ungel, Raphael and Idrissi-Yaghir, Ahmad and Sch\"afer, Henning and Bracke, Benjamin and Damm, Hendrik and Pakull, Tabea M. G. and Schmidt, Cynthia Sabrina and M\"uller, Henning and Friedrich, Christoph M.},
title = {Overview of {ImageCLEFmedical} 2024 -- {Caption Prediction and Concept Detection}},
booktitle = {CLEF2024 Working Notes},
series = {{CEUR} Workshop Proceedings},
year = {2024},
volume = {},
publisher = {CEUR-WS.org},
pages = {},
month = {September 9-12},
address = {Grenoble, France}
}
Contact
Organizers:
- Johannes Rückert <johannes.rueckert(at)fh-dortmund.de>, University of Applied Sciences and Arts Dortmund, Germany
- Asma Ben Abacha <abenabacha(at)microsoft.com>, Microsoft, USA
- Alba García Seco de Herrera <alba.garcia(at)essex.ac.uk>,University of Essex, UK
- Christoph M. Friedrich <christoph.friedrich(at)fh-dortmund.de>, University of Applied Sciences and Arts Dortmund, Germany
- Henning Müller <henning.mueller(at)hevs.ch>, University of Applied Sciences Western Switzerland, Sierre, Switzerland
- Louise Bloch <louise.bloch(at)fh-dortmund.de>, University of Applied Sciences and Arts Dortmund, Germany
- Raphael Brüngel <raphael.bruengel(at)fh-dortmund.de>, University of Applied Sciences and Arts Dortmund, Germany
- Ahmad Idrissi-Yaghir <ahmad.idrissi-yaghir(a)fh-dortmund.de>, University of Applied Sciences and Arts Dortmund, Germany
- Henning Schäfer <henning.schaefer(at)uk-essen.de>, Institute for Transfusion Medicine, University Hospital Essen, Germany
- Cynthia S. Schmidt, Institute for Artificial Intelligence in Medicine (IKIM), University Hospital Essen
- Tabea M. G. Pakull, Institute for Transfusion Medicine, University Hospital Essen, Germany
- Hendrik Damm, University of Applied Sciences and Arts Dortmund, Germany
- Benjamin Bracke, University of Applied Sciences and Arts Dortmund, Germany
Acknowledgments
[1] Pelka, O., Koitka, S., Rückert, J., Nensa, F., & Friedrich, C. M. (2018). Radiology Objects in COntext (ROCO): A Multimodal Image Dataset. In Intravascular Imaging and Computer Assisted Stenting and Large-Scale Annotation of Biomedical Data and Expert Label Synthesis (pp. 180–189). Springer International Publishing.
[2] Rückert, J., Bloch, L., Brüngel, R., Idrissi-Yaghir, A., Schäfer, H., Schmidt, C. S., Koitka, S., Pelka, O., Abacha, A. B., de Herrera, A. G. S., Müller, H., Horn, P. A., Nensa, F., & Friedrich, C. M. (2024). ROCOv2: Radiology objects in COntext version 2, an updated multimodal image dataset. https://doi.org/10.48550/ARXIV.2405.10004