Welcome to the 5th edition of the Caption Task!
Motivation
Interpreting and summarizing the insights gained from medical images such as radiology output is a time-consuming task that involves highly trained experts and often represents a bottleneck in clinical diagnosis pipelines.
Consequently, there is a considerable need for automatic methods that can approximate this mapping from visual information to condensed textual descriptions. The more image characteristics are known, the more structured are the radiology scans and hence, the more efficient are the radiologists regarding interpretation. We work on the basis of a large-scale collection of figures from open access biomedical journal articles (PubMed Central), as well as radiology images from original medical cases. All images in the training data are accompanied by UMLS concepts extracted from the original image caption.
Lessons learned:
- In the first and second editions of this task, held at ImageCLEF 2017 and ImageCLEF 2018, participants noted a broad variety of content and situation among training images. In 2019, the training data was reduced solely to radiology images
- In ImageCLEF 2020 the focus remained on radiology images, with additional imaging modality information, for pre-processing purposes and multi-modal approaches
- The focus in ImageCLEF 2021 lies in using real radiology images annotated by medical doctors. This step aims at increasing the medical context relevance of the UMLS concepts
- To reduce the scope and size of concepts, several concept extraction tools are analyzed prior to caption pre-processing methods.
- Concepts with less occurrence will be removed
- As uncertainty regarding additional source was noted, we will clearly separate systems using exclusively the official training data from those that incorporate additional sources of evidence
News
Task Description
In ImageCLEFmed Caption 2021 consists of two substaks:
Concept Detection Task
The first step to automatic image captioning and scene understanding is identifying the presence and location of relevant concepts in a large corpus of medical images. Based on the visual image content, this subtask provides the building blocks for the scene understanding step by identifying the individual components from which captions are composed. The concepts can be further applied for context-based image and information retrieval purposes.
Evaluation is conducted in terms of set coverage metrics such as precision, recall, and combinations thereof. This task will be run using real clinical radiology images with annotations from medical doctors. In addition, a subset of the extended Radiology Objects in COntext (ROCO) dataset [1], with imaging modality information, for training purposes.
Caption Prediction Task
On the basis of the concept vocabulary detected in the first subtask as well as the visual information of their interaction in the image, participating systems are tasked with composing coherent captions for the entirety of an image. In this step, rather than the mere coverage of visual concepts, detecting the interplay of visible elements is crucial for strong performance.
Evaluation of this second step is based on metrics such as BLEU that have been designed to be robust to variability in style and wording.
Data
The development and test datasets that will be distributed are the same with the dataset of the ImageCLEF-VQAMed 2021. This encourages teams to perticipate in both tasks, as detected concepts can be used as building blocks for the VQA tasks. But also generate quastions and answers can be used to evaluate the concept detection models.
Training Set: We will use the VQA-Med 2020 training data: https://www.aicrowd.com/challenges/imageclef-2020-vqa-med-vqa
Validation Set: Consists of 500 radiology images
Test Set: Consists of 444 radiology images
Evaluation methodology
Concept Detection
Evaluation is conducted in terms of F1 scores between system predicted and ground truth concepts, using the following methodology and parameters:
- The default implementation of the Python scikit-learn (v0.17.1-2) F1 scoring method is used. It is documented here.
- A Python (3.x) script loads the candidate run file, as well as the ground truth (GT) file, and processes each candidate-GT concept sets
- For each candidate-GT concept set, the y_pred and y_true arrays are generated. They are binary arrays indicating for each concept contained in both candidate and GT set if it is present (1) or not (0).
- The F1 score is then calculated. The default 'binary' averaging method is used.
- All F1 scores are summed and averaged over the number of elements in the test set (10'000), giving the final score.
The ground truth for the test set was generated based on the UMLS Full Release 2017AB.
NOTE: The source code of the evaluation tool is available here. It must be executed using Python 3.x, on a system where the scikit-learn (>= v0.17.1-2) Python library is installed. The script should be run like this:
/path/to/python3 evaluate-f1.py /path/to/candidate/file /path/to/ground-truth/file
Caption Prediction
Evaluation is based on BLEU scores, using the following methodology and parameters:
- The default implementation of the Python NLTK (v3.2.2) (Natural Language ToolKit) BLEU scoring method is used. It is documented here and based on the original article describing the BLEU evaluation method
- A Python (3.6) script loads the candidate run file, as well as the ground truth (GT) file, and processes each candidate-GT caption pair
- Each caption is pre-processed in the following way:
- The caption is converted to lower-case
- All punctuation is removed an the caption is tokenized into its individual words
- Stopwords are removed using NLTK's "english" stopword list
- Stemming is applied using NLTK's Snowball stemmer
- The BLEU score is then calculated. Note that the caption is always considered as a single sentence, even if it actually contains several sentences. No smoothing function is used.
- All BLEU scores are summed and averaged over the number of captions (10'000), giving the final score.
NOTE : The source code of the evaluation tool is available here. It must be executed using Python 3.6.x, on a system where the NLTK (v3.2.2) Python library is installed. The script should be run like this:
/path/to/python3.6 evaluate-bleu.py /path/to/candidate/file /path/to/ground-truth/file
Participant registration
Please refer to the general ImageCLEF registration instructions
Preliminary Schedule
- 16 November 2020: Registration opens
- 6 March 2021: Release of the training and validation sets
- 29 April 2021: Release of the test sets
- 30 April 2021: Registration closes
- 10 May 2021: Run submission deadline
- 17 May 2021: Release of the processed results by the task organizers
- 28 May 2021: Submission of participant papers [CEUR-WS]
- 21 May – 11 June 2021: Review process of participant papers
- 11 June 2021: Notification of acceptance
- 2 July 2021: Camera ready copy of participant papers and extended lab overviews [CEUR-WS]
- 21-24 September 2021: The CLEF Conference, Bucharest, Romania.
Submission Instructions
The submissions will be received through the crowdAI
system.
Please note that each group is allowed a maximum of 10 runs per subtask.
Concept Detection
For the submission of the concept detection task we expect the following format:
- <Figure-ID>|<Concept-ID-1>;<Concept-ID-2>;<Concept-ID-n>
You need to respect the following constraints:
- The separator between the figure ID and the concepts has to be a pipe character ( | )
- The separator between the UMLS concepts has to be a semicolon (;)
- Each figure ID of the test set must be included in the submitted file exactly once (even if there are not concepts)
- The same concept cannot be specified more than once for a given figure ID
- The maximum number of concepts per image is 100
Caption prediction
For the submission of the caption prediction task we expect the following format:
- <Figure-ID>|<description>
You need to respect the following constraints:
- The separator between the figure ID and the description has to be a pipe charachter ( | )
- Each figure ID of the testset must be included in the runfile exactly once
- You should not include special characters in the description.
Results
Concept Detection Task
Group Name |
Submission Run |
F1 Score |
Rank |
AUEBs_NLP_Group |
136458 |
0.505 |
1 |
AUEB_NLP_Group |
136455 |
0.495 |
2 |
AUEBs_NLP_Group |
135963 |
0.493 |
3 |
AUEBs_NLP_Group |
136052 |
0.493 |
4 |
AUEBs_NLP_Group |
135847 |
0.490 |
5 |
NLIP-Essex-ITESM |
132945 |
0.469 |
6 |
AUEBs_NLP_Group |
135870 |
0.466 |
7 |
AUEBs_NLP_Group |
135862 |
0.459 |
8 |
AUEBs_NLP_Group |
136307 |
0.456 |
9 |
NLIP-Essex-ITESM |
136429 |
0.451 |
10 |
AUEBs_NLP_Group |
135989 |
0.451 |
11 |
NLIP-Essex-ITESM |
136404 |
0.440 |
12 |
NLIP-Essex-ITESM |
136400 |
0.423 |
13 |
ImageSem |
135873 |
0.419 |
14 |
NLIP-Essex-ITESM |
133912 |
0.412 |
15 |
ImageSem |
135871 |
0.400 |
16 |
ImageSem |
136142 |
0.396 |
17 |
ImageSem |
135858 |
0.380 |
18 |
ImageSem |
136129 |
0.370 |
19 |
IALab_PUC |
135810 |
0.360 |
20 |
NLIP-Essex-ITESM |
136379 |
0.355 |
21 |
ImageSem |
136140 |
0.355 |
22 |
AUEBs_NLP_Group |
136371 |
0.348 |
23 |
ImageSem |
136141 |
0.327 |
24 |
RomiBed |
136011 |
0.143 |
25 |
IALab_PUC |
135197 |
0.141 |
26 |
RomiBed |
136025 |
0.137 |
27 |
ImageSem |
136143 |
0.037 |
28 |
ImageSem |
136144 |
0.019 |
29 |
Caption Prediction
Group Name |
Submission Run |
BLEU score |
Rank |
IALab_PUC |
136474 |
0.50983788585763 |
1 |
IALab_PUC |
136419 |
0.508677997215836 |
2 |
AUEB_NLP_Group |
135921 |
0.46100565957904 |
3 |
AUEB_NLP_Group |
136370 |
0.451864143813623 |
4 |
AUEB_NLP_Group |
136489 |
0.447905359814873 |
5 |
IALab_PUC |
135736 |
0.441506218931436 |
6 |
AUEB_NLP_Group |
135772 |
0.440137749012683 |
7 |
AEHRC-CSIRO |
135507 |
0.431972275062546 |
8 |
AEHRC-CSIRO |
135895 |
0.430406007183828 |
9 |
AEHRC-CSIRO |
135049 |
0.42567134917754 |
10 |
AEHRC-CSIRO |
134637 |
0.422792924414946 |
11 |
AEHRC-CSIRO |
136097 |
0.419496101550501 |
12 |
AEHRC-CSIRO |
135926 |
0.415503471247303 |
13 |
AEHRC-CSIRO |
136228 |
0.414535967436242 |
14 |
AEHRC-CSIRO |
136231 |
0.405196922808508 |
15 |
AEHRC-CSIRO |
133707 |
0.388372250516967 |
16 |
IALab_PUC |
133189 |
0.377543312205065 |
17 |
AUEB_NLP_Group |
134819 |
0.375175109891123 |
18 |
IALab_PUC |
134267 |
0.369958445404712 |
19 |
kdelab |
134753 |
0.361634086930148 |
20 |
kdelab |
135513 |
0.361634086930148 |
21 |
kdelab |
135512 |
0.361634086930148 |
22 |
IALab_PUC |
136106 |
0.353575002584263 |
23 |
kdelab |
134435 |
0.352245261968624 |
24 |
IALab_PUC |
134274 |
0.350840816888247 |
25 |
kdelab |
134707 |
0.33880855427956 |
26 |
kdelab |
135467 |
0.296898143109014 |
27 |
kdelab |
135466 |
0.291086418426781 |
28 |
kdelab |
135510 |
0.287383069524119 |
29 |
jeanbenoit_delbrouck |
135533 |
0.285098453903958 |
30 |
kdelab |
135302 |
0.279629946846044 |
31 |
kdelab |
134433 |
0.267316681105769 |
32 |
ImageSem |
136138 |
0.256512698844459 |
33 |
jeanbenoit_delbrouck |
135448 |
0.250921036985552 |
34 |
jeanbenoit_delbrouck |
134878 |
0.250921036985552 |
35 |
RomiBed |
135896 |
0.242797351608935 |
36 |
ImageSem |
136135 |
0.202508814797052 |
37 |
AUEB_NLP_Group |
136459 |
0.198584917801623 |
38 |
ImageSem |
136136 |
0.180687059064707 |
39 |
ImageSem |
136131 |
0.136917777826475 |
40 |
ayushnanda14 |
136389 |
0.10291065614204 |
41 |
ImageSem |
136148 |
0.101744464940667 |
42 |
ImageSem |
136146 |
0.049288015431136 |
43 |
ImageSem |
136147 |
0.037992321228843 |
44 |
ImageSem |
135956 |
0.003577051997462 |
45 |
ImageSem |
135955 |
0.001089255912647 |
46 |
CEUR Working Notes
Citations
When referring to the ImageCLEFmed 2021 concept detection task general goals, general results, etc. please cite the following publication which will be published by September 2021:
- Obioma Pelka, Asma Ben Abacha, Alba García Seco de Herrera , Janadhip Jacutprakart , Christoph M. Friedrich, and Henning Müller. Overview of the ImageCLEFmed 2021 Concept & Caption Prediction Task., in Experimental IR Meets Multilinguality, Multimodality, and Interaction. CEUR Workshop Proceedings (CEUR- WS.org), Bucharest, Romania,, September 21-24, 2021.
- BibTex:
@Inproceedings{ImageCLEFmedConceptOverview2021,
author = { Pelka, Obioma and Ben Abacha, Asma and Garc\'ia Seco de Herrera, Alba and Jacutprakart , Janadhip and Friedrich, Christoph M and M\"uller, Henning},
title = {Overview of the {ImageCLEFmed} 2021 Concept \& Caption Prediction Task},
booktitle = {CLEF2021 Working Notes},
series = {{CEUR} Workshop Proceedings},
year = {2021},
volume = {},
publisher = {CEUR-WS.org },
pages = {},
month = {September 21-24},
address = {Bucharest, Romania}
}
Contact
- Obioma Pelka <obioma.pelka(at)fh-dortmund.de>, University of Applied Sciences and Arts Dortmund, Germany
- Asma Ben Abacha <asma.benabacha(at)nih.gov>, National Library of Medicine, USA
- Alba García Seco de Herrera <alba.garcia(at)essex.ac.uk>,University of Essex, UK
- Janadhip Jacutprakart <j.jacutprakart(at)essex.ac.uk>,University of Essex, UK
- Christoph M. Friedrich <christoph.friedrich(at)fh-dortmund.de>, University of Applied Sciences and Arts Dortmund, Germany
- Henning Müller <henning.mueller(at)hevs.ch>, University of Applied Sciences Western Switzerland, Sierre, Switzerland
Join our mailing list: https://groups.google.com/d/forum/imageclefcaption
Follow @imageclef
Acknowledgments
[1] O. Pelka, S. Koitka, J. Rückert, F. Nensa und C. M. Friedrich „Radiology Objects in COntext (ROCO): A Multimodal Image Dataset“, Proceedings of the MICCAI Workshop on Large-scale Annotation of Biomedical data and Expert Label Synthesis (MICCAI LABELS 2018), Granada, Spain, September 16, 2018, Lecture Notes in Computer Science (LNCS) Volume 11043, Page 180-189, DOI: 10.1007/978-3-030-01364-6_20, Springer Verlag, 2018.