Welcome to the 2nd edition of the Caption Task!
Motivation
Interpreting and summarizing the insights gained from medical images such as radiology output is a time-consuming task that involves highly trained experts and often represents a bottleneck in clinical diagnosis pipelines.
Consequently, there is a considerable need for automatic methods that can approximate this mapping from visual information to condensed textual descriptions. In this task, we cast the problem of image understanding as a cross-modality matching scenario in which visual content and textual descriptors need to be aligned and concise textual interpretations of medical images are generated. We work on the basis of a large-scale collection of figures from open access biomedical journal articles (PubMed Central). Each image is accompanied by its original caption, constituting a natural testbed for this image captioning task.
Lessons learned: In the first edition of this task, held at CLEF 2017, participants noted a broad topical variability among training images. This year, we will further group training data into image types (e.g., radiology vs. biopsy) and task participants will building either cross category models or category-specific ones. An additional source of uncertainty was noted in the use of external material. In this second edition of the task, we will clearly separate systems using exclusively the official training data from those that incorporate additional sources of evidence.
.
News
- 22.03.2018: Submission is open
- 22.03.2018: Test set is released.
- 26.02.2018: Due to the concept C0221055 was wrongly associate with every single image, we fixed this and provided an updated training set for download.
- 06.02.2018: Due to the case-insensitive of some systems we have renamed the following images in the training set:
FigS2.jpg -> FigS2_PMC3770837.jpg
IPC-15-1-g001.jpg -> IPC-15-1-g001_PMC3663154.jpg
IPC-15-1-g002.jpg -> IPC-15-1-g002_PMC3669542.jpg
Concept Detection Task
As a first step to automatic image captioning and scene understanding, participating systems are tasked with identifying the presence and location of relevant concepts in a large corpus of medical images. Based on the visual image content, this subtask provides the building blocks for the scene understanding step by identifying the individual components from which captions are composed. Evaluation is conducted in terms of set coverage metrics such as precision, recall and combinations thereof. This task will be run on a subset of the data as manual ground truthing is required.
Caption Prediction Task
On the basis of the concept vocabulary detected in the first subtask as well as the visual information of their interaction in the image, participating systems are tasked with composing coherent captions for the entirety of an image. In this step, rather than the mere coverage of visual concepts, detecting the interplay of visible elements is crucial for strong performance. Evaluation of this second step is based on metrics such as BLEU that have been designed to be robust to variability in style and wording.
Data
The collection comprises a total of 4 million image-caption pairs that could potentially all be used for training with a small subset being removed for testing. To focus on useful radiology/clinical images and non-compound figures is likely good for this task to reduce the number of image-caption pairs to around 400,000, so significantly larger that in 2017.
Evaluation methodology
Concept detection
Evaluation is conducted in terms of F1 scores between system predicted and ground truth concepts, using the following methodology and parameters:
- The default implementation of the Python scikit-learn (v0.17.1-2) F1 scoring method is used. It is documented here.
- A Python (3.x) script loads the candidate run file, as well as the ground truth (GT) file, and processes each candidate-GT concept sets
- For each candidate-GT concept set, the y_pred and y_true arrays are generated. They are binary arrays indicating for each concept contained in both candidate and GT set if it is present (1) or not (0).
- The F1 score is then calculated. The default 'binary' averaging method is used.
- All F1 scores are summed and averaged over the number of elements in the test set (10'000), giving the final score.
The ground truth for the test set was generated based on the UMLS Full Release 2016AB.
NOTE : The source code of the evaluation tool is available here. It must be executed using Python 3.x, on a system where the scikit-learn (>= v0.17.1-2) Python library is installed. The script should be run like this:
/path/to/python3 evaluate-f1.py /path/to/candidate/file /path/to/ground-truth/file
Caption prediction
Evaluation is based on BLEU scores, using the following methodology and parameters:
- The default implementation of the Python NLTK (v3.2.2) (Natural Language ToolKit) BLEU scoring method is used. It is documented here and based on the original article describing the BLEU evaluation method
- A Python (3.6) script loads the candidate run file, as well as the ground truth (GT) file, and processes each candidate-GT caption pair
- Each caption is pre-processed in the following way:
- The caption is converted to lower-case
- All punctuation is removed an the caption is tokenized into its individual words
- Stopwords are removed using NLTK's "english" stopword list
- Stemming is applied using NLTK's Snowball stemmer
- The BLEU score is then calculated. Note that the caption is always considered as a single sentence, even if it actually contains several sentences. No smoothing function is used.
- All BLEU scores are summed and averaged over the number of captions (10'000), giving the final score.
NOTE : The source code of the evaluation tool is available here. It must be executed using Python 3.6.x, on a system where the NLTK (v3.2.2) Python library is installed. The script should be run like this:
/path/to/python3.6 evaluate-bleu.py /path/to/candidate/file /path/to/ground-truth/file
Preliminary Schedule
- 08.11.2017: registration opens for all ImageCLEF tasks (until 27.04.2018)
- 31.01.2018: development data release starts
- 22.03.2018: test data release starts
- 01.05.2018: deadline for submitting the participants runs
- 15.05.2018: release of the processed results by the task organizers
- 31.05.2018: deadline for submission of working notes papers by the participants
- 15.06.2018: notification of acceptance of the working notes papers
- 29.06.2018: camera ready working notes papers
- 10-14.09.2018: CLEF 2018, Avignon, France
Participant registration
Please refer to the general ImageCLEF registration instructions
Submission instructions
Please note that each group is allowed a maximum of 10 runs per subtask.
Concept detection
For the submission of the concept detection task we expect the following format:
- <Figure-ID><TAB><Concept-ID-1>;<Concept-ID-2>;<Concept-ID-n>
e.g.:
- 1743-422X-4-12-1-4 C1;C6;C100
- 1743-422X-4-12-1-3 C89;C374
- 1743-422X-4-12-1-2 C8374
You need to respect the following constraints:
- The separator between the figure ID and the concepts has to be a tabular whitespace
- The separator between the UMLS concepts has to be a semicolon (;)
- Each figure ID of the testset must be included in the runfile exactly once (even if there are not concepts)
- The same concept cannot be specified more than once for a given figure ID
Caption prediction
For the submission of the caption prediction task we expect the following format:
- <Figure-ID><TAB><description>
e.g.:
- 1743-422X-4-12-1-4 description of the first image in one single line
- 1743-422X-4-12-1-3 description of the second image....
- 1743-422X-4-12-1-2 descrition of the third image...
You need to respect the following constraints:
- The separator between the figure ID and the description has to be a tabular whitespace
- Each figure ID of the testset must be included in the runfile exactly once
- You should not include special characters in the description.
Results
Group name |
Run |
Mean F1 Score |
UA.PT_Bioinformatics |
aae-500-o0-2018-04-30_1217 |
0.1108 |
UA.PT_Bioinformatics |
aae-2500-merge-2018-04-30_1812 |
0.1082 |
UA.PT_Bioinformatics |
lin-orb-500-o0-2018-04-30_1142 |
0.0978 |
ImageSem |
run10extended_results_concept_1000_steps_25000_learningrate_0.03_batch_20 |
0.0928 |
ImageSem |
run02extended_results-testdata |
0.0909 |
ImageSem |
run4more1000 |
0.0907 |
ImageSem |
run01candidate_image_test_0.005 |
0.0894 |
ImageSem |
run05extended_results_concept_1000_top20 |
0.0828 |
UA.PT_Bioinformatics |
faae-500-o0-2018-04-27_1744 |
0.0825 |
ImageSem |
run06top2000_extended_results |
0.0661 |
UA.PT_Bioinformatics |
knn-ip-aae-train-2018-04-27_1259 |
0.0569 |
UA.PT_Bioinformatics |
knn-aae-all-2018-04-26_1233 |
0.0559 |
IPL |
DET_IPL_CLEF2018_w_300_annot_70_gboc_200 |
0.0509 |
Morgan |
result_concept_new |
0.0418 |
AILAB |
results_v3 |
0.0415 |
IPL |
DET_IPL_CLEF2018_w_300_annot_40_gboc_200 |
0.0406 |
AILAB |
results |
0.0405 |
IPL |
DET_IPL_CLEF2018_w_300_annot_30_gboc_200 |
0.0351 |
UA.PT_Bioinformatics |
knn-orb-all-2018-04-24_1620 |
0.0314 |
IPL |
DET_IPL_CLEF2018_w_200_annot_30_gboc_200 |
0.0307 |
UA.PT_Bioinformatics |
knn-ip-faae-all-2018-04-27_1512 |
0.0280 |
UA.PT_Bioinformatics |
knn-ip-faae-all-2018-04-27_1512 |
0.0272 |
IPL |
DET_IPL_CLEF2018_w_200_annot_20_gboc_200 |
0.0244 |
IPL |
DET_IPL_CLEF2018_w_200_annot_15_gboc_200 |
0.0202 |
IPL |
DET_IPL_CLEF2018_w_100_annot_20_gboc_100 |
0.0161 |
AILAB |
results_v3 |
0.0151 |
IPL |
DET_IPL_CLEF2018_w_200_annot_5_gboc_200 |
0.0080 |
ImageSem |
run03candidate_image_test_0.005douhao |
0.0001 |
|
Group name |
Run |
Mean BLEU score |
ImageSem |
run04Captionstraining |
0.2501 |
ImageSem |
run09Captionstraining |
0.2343 |
ImageSem |
run13Captionstraining |
0.2278 |
ImageSem |
run19Captionstraining |
0.2271 |
ImageSem |
run03Captionstraining |
0.2244 |
ImageSem |
run07Captionstraining |
0.2228 |
ImageSem |
run08Captionstraining |
0.2221 |
ImageSem |
run06Captionstraining |
0.1963 |
UMMS |
test_captions_output4_13_epoch |
0.1799 |
UMMS |
test_captions_output2_12_epoch |
0.1763 |
Morgan |
result_caption |
0.1725 |
1 |
UMMS |
test_captions_output1 |
0.1696 |
UMMS |
test_captions_output5_13_epoch |
0.1597 |
UMMS |
test_captions_output3_13_epoch |
0.1428 |
KU Leuven |
23_test_valres_0.134779058389_out_file_greedy |
0.1376 |
WHU |
CaptionPredictionTesting-Results-zgb |
0.0446 |
|
Citations
- When referring to the ImageCLEFcaption 2018 task general goals, general results, etc. please cite the following publication which will be published by September 2018:
-
Alba García Seco de Herrera, Carsten Eickhoff, Vincent Andrearczyk and Henning Müller. Overview of the ImageCLEF 2018 caption prediction tasks, CLEF working notes, CEUR, 2018.
-
BibTex:
@Inproceedings{ImageCLEFcaptionoverview2018,
author = {Garc\'ia Seco de Herrera, Alba and Eickhoff, Carsten and Andrearczyk, Vincent and and M\"uller, Henning},
title = {Overview of the {ImageCLEF} 2018 Caption Prediction tasks},
booktitle = {CLEF2018 Working Notes},
series = {{CEUR} Workshop Proceedings},
year = {2018},
volume = {},
publisher = {CEUR-WS.org $<$http://ceur-ws.org$>$},
pages = {},
month = {September 10-14},
address = {Avignon, France},
}
When referring to the ImageCLEF 2018 task in general, please cite the following publication which will be published by September 2018:
-
Bogdan Ionescu, Henning Müller, Mauricio Villegas, Alba García Seco de Herrera, Carsten Eickhoff, Vincent Andrearczyk, Yashin Dicente Cid, Vitali Liauchuk, Vassili Kovalev, Sadid A. Hasan, Yuan Ling, Oladimeji Farri, Joey Liu, Matthew Lungren, Duc-Tien Dang-Nguyen, Luca Piras, Michael Riegler, Liting Zhou, Mathias Lux and Cathal Gurrin. Overview of ImageCLEF 2018: Challenges, Datasets and Evaluation, Proceedings of the Ninth International Conference of the CLEF Association (CLEF 2018), 2018.
-
-
BibTex:
@inproceedings{ImageCLEF18,
author = {Bogdan Ionescu and Henning M\"uller and Mauricio Villegas
and Alba Garc\'ia Seco de Herrera and Carsten Eickhoff and Vincent
Andrearczyk and Yashin Dicente Cid and Vitali Liauchuk and Vassili
Kovalev and Sadid A. Hasan and Yuan Ling and Oladimeji Farri and Joey
Liu and Matthew Lungren and Duc-Tien Dang-Nguyen and Luca Piras and
Michael Riegler and Liting Zhou and Mathias Lux and Cathal Gurrin},
title = {{Overview of ImageCLEF 2018}: Challenges, Datasets and Evaluation},
booktitle = {Experimental IR Meets Multilinguality, Multimodality, and
Interaction},
series = {Proceedings of the Ninth International Conference of the
CLEF Association (CLEF 2018)},
year = {2018},
volume = {},
publisher = {{LNCS} Lecture Notes in Computer Science, Springer},
pages = {},
month = {September 10-14},
address = {Avignon, France},
}
Contact
- Carsten Eickhoff <c.eickhoff(at)acm.org>, ETH Zurich, Switzerland
- Alba García Seco de Herrera <alba.garcia(at)essex.ac.uk>,University of Essex, UK
- Henning Müller <henning.mueller(at)hevs.ch>, University of Applied Sciences Western Switzerland, Sierre, Switzerland
- Vincent Andrearczyk <vincent.andrearczyk(at)hevs.ch>, University of Applied Sciences Western Switzerland, Sierre, Switzerland
Join our mailing list: https://groups.google.com/d/forum/imageclefcaption
Acknowledgements