Welcome
Interpreting and summarizing the insights gained from medical images such as radiology output is a time-consuming task that involves highly trained experts and often represents a bottleneck in clinical diagnosis pipelines. Consequently, there is a considerable need for automatic methods that can approximate this mapping from visual information to condensed textual descriptions. In this task, we cast the problem of image understanding as a cross-modality matching scenario in which visual content and textual descriptors need to be aligned and concise textual interpretations of medical images are generated. We work on the basis of a large-scale collection of figures from open access bio-medical journal articles (PubMed Central). Each image is accompanied by its original caption, constituting a natural testbed for this image captioning task.
News
- 6.2.2017: Training data set is released.
- 18.10.2016: ImageCLEFCaption Website goes live.
Concept Detection Task
As a first step to automatic image captioning and scene understanding, participating systems are tasked with identifying the presence of relevant biomedical concepts in medical images. Based on the visual image content, this subtask provides the building blocks for the scene understanding step by identifying the individual components from which full captions will be composed.
Caption Prediction Task
On the basis of the concept vocabulary detected in the first subtask as well as the visual information of their interaction in the image, participating systems are tasked with composing coherent captions for the entirety of an image. In this step, rather than the mere coverage of visual concepts, detecting the interplay of visible elements is crucial for recreating the original image caption.
Data
The training set for both subtasks contains 164,614 biomedical images extracted from scholarly articles on PubMed Central.
For the concept detection subtask, a file containing image ID and corresponding UMLS concepts is provided.
For the caption prediction subtask, a file containing image ID - caption pairs is provided.
Additionally, a validation set of 10,000 images is provided for both subtasks.
The test set will contain 10,000 images for both subtasks.
Evaluation methodology
Concept detection
Evaluation is conducted in terms of F1 scores between system predicted and ground truth concepts, using the following methodology and parameters:
- The default implementation of the Python scikit-learn (v0.17.1-2) F1 scoring method is used. It is documented here.
- A Python (3.x) script loads the candidate run file, as well as the ground truth (GT) file, and processes each candidate-GT concept sets
- For each candidate-GT concept set, the y_pred and y_true arrays are generated. They are binary arrays indicating for each concept contained in both candidate and GT set if it is present (1) or not (0).
- The F1 score is then calculated. The default 'binary' averaging method is used.
- All F1 scores are summed and averaged over the number of elements in the test set (10'000), giving the final score.
The ground truth for the test set was generated based on the UMLS Full Release 2016AB.
NOTE : The source code of the evaluation tool is available here. It must be executed using Python 3.x, on a system where the scikit-learn (>= v0.17.1-2) Python library is installed. The script should be run like this:
/path/to/python3 evaluate-f1.py /path/to/candidate/file /path/to/ground-truth/file
Caption prediction
Evaluation is based on BLEU scores, using the following methodology and parameters:
- The default implementation of the Python NLTK (v3.2.2) (Natural Language ToolKit) BLEU scoring method is used. It is documented here and based on the original article describing the BLEU evaluation method
- A Python (3.6) script loads the candidate run file, as well as the ground truth (GT) file, and processes each candidate-GT caption pair
- Each caption is pre-processed in the following way:
- The caption is converted to lower-case
- All punctuation is removed an the caption is tokenized into its individual words
- Stopwords are removed using NLTK's "english" stopword list
- Stemming is applied using NLTK's Snowball stemmer
- The BLEU score is then calculated. Note that the caption is always considered as a single sentence, even if it actually contains several sentences. No smoothing function is used.
- All BLEU scores are summed and averaged over the number of captions (10'000), giving the final score.
NOTE : The source code of the evaluation tool is available here. It must be executed using Python 3.6.x, on a system where the NLTK (v3.2.2) Python library is installed. The script should be run like this:
/path/to/python3.6 evaluate-bleu.py /path/to/candidate/file /path/to/ground-truth/file
Preliminary Schedule
- 15.11.2016: registration opens for all ImageCLEF tasks (until 22.04.2016)
- 01.02.2017: development data release starts
- 15.03.2017: test data release starts
- 05.05.2017: deadline for submission of runs by the participants
- 15.05.2017: release of processed results by the task organizers
- 26.05.2017: deadline for submission of working notes papers by the participants
- 17.06.2017: notification of acceptance of the working notes papers
- 01.07.2017: camera ready working notes papers
- 11.-14.09.2017: CLEF 2017, Dublin, Ireland
Participant registration
Please refer to the general registration section for ImageCLEF 2017.
Submission instructions
Please note that each group is allowed a maximum of 10 runs per subtask.
Concept detection
For the submission of the concept detection task we expect the following format:
- <Figure-ID><TAB><Concept-ID-1>,<Concept-ID-2>,<Concept-ID-n>
e.g.:
- 1743-422X-4-12-1-4 C1,C6,C100
- 1743-422X-4-12-1-3 C89,C374
- 1743-422X-4-12-1-2 C8374
You need to respect the following constraints:
- The separator between the figure ID and the concepts has to be a tabular whitespace
- The separator between the UMLS concepts has to be a comma (,)
- A maximum of 50 UMLS concepts per figure is accepted
- Each figure ID of the testset must be included in the runfile exactly once (even if there are not concepts)
- The name of the run file has to start with "DET"
Caption prediction
For the submission of the caption prediction task we expect the following format:
- <Figure-ID><TAB><description>
e.g.:
- 1743-422X-4-12-1-4 description of the first image in one single line
- 1743-422X-4-12-1-3 description of the second image....
- 1743-422X-4-12-1-2 descrition of the third image...
- The name of the run file has to start with "PRED"
You need to respect the following constraints:
- The separator between the figure ID and the description has to be a tabular whitespace
- Each figure ID of the testset must be included in the runfile exactly once
- You should not include special characters in the description.
Results
DISCLAIMER : The results presented below have not yet been analyzed
in-depth and are shown "as is".
Due to differences in the methods used by different groups, the results are shown
in 3 different rankings:
- 1 for runs where no external resources were used
- 1 for runs where external resources were used but it is certain that none of the test data was included
- 1 for runs that used external resources which may include parts of the test data
The tables and rankings will be updated as new information is provided on the
methods used in the various runs.
Caption Prediction - No External Resources Used |
Group name |
Run |
Run Type |
Mean BLEU score |
Rank |
NLM |
1494038340934__PRED_run_4_CNN_comb.txt |
Automatic |
0.2247 |
1 |
NLM |
1494038056289__PRED_run_3_CNN_239.txt |
Automatic |
0.1384 |
2 |
NLM |
1494037493960__PRED_run_2_CNN_92.txt |
Automatic |
0.1131 |
3 |
Caption Prediction - External Resources Used, No Test Data Included |
Group name |
Run |
Run Type |
Mean BLEU score |
Rank |
NLM |
1495446212270__PRED_X_Caption_run_1_baseline.txt |
Automatic |
0.2646 |
1 |
Caption Prediction - External Resources Used, Test Data Potentially Included |
Group name |
Run |
Run Type |
Mean BLEU score |
Rank |
NLM |
1494014231230__PRED_run_1_OpeniMethod.txt |
Automatic |
0.5634 |
1 |
NLM |
1494081858362__PRED_run_5_comb_all.txt |
Automatic |
0.3317 |
2
|
Caption Prediction - Unknown |
Group name |
Run |
Run Type |
Mean BLEU score |
Rank |
AILAB |
1493825734124__PRED_prna_run4.txt |
Automatic |
0.3211 |
1 |
AILAB |
1493824027725__PRED_prna_run1.txt |
Automatic |
0.2638 |
2 |
isia |
1493921574200__PRED test_13_svm_3_nn_dist_25_normal_noUNK |
Automatic |
0.2600 |
3 |
isia |
1493666388885__PRED test_5_svm_nn_dist_3000_nounk_modified_2 |
Automatic |
0.2507 |
4 |
isia |
1493922473076__PRED test_12_svm_3_nn_dist_25_normal |
Automatic |
0.2454 |
5
|
isia |
1494002110282__PRED test_11_svm_2_nn_dist_25_normal_noUNK |
Automatic |
0.2386 |
6 |
isia |
1493922527122__PRED test_10_svm_2_nn_dist_25_normal |
Automatic |
0.2315 |
7 |
isia |
1493831729114__PRED test_9_svm_three_nn_3000_noUNK |
Automatic |
0.2240 |
9 |
isia |
1493745561070__PRED test_6_svm_three_parts |
Automatic |
0.2193 |
10 |
isia |
1493715950351__PRED test_2_svm_two |
Automatic |
0.1953 |
11 |
isia |
1493528631975__PRED test_1_wc5sl70 |
Automatic |
0.1912 |
12 |
AILAB |
1493825504037__PRED_prna_run3.txt |
Automatic |
0.1801 |
13 |
isia |
1493831517474__PRED test_8_svm_two_remove_UNK |
Automatic |
0.1684 |
14 |
AILAB |
1493824818237__PRED_prna_run2.txt |
Automatic |
0.1107 |
17 |
BMET |
1493702564824__PRED_merge_01.txt |
Automatic |
0.0982 |
18 |
BMET |
1493698682901__PRED_3layer_998981.txt |
Automatic |
0.0851 |
19 |
BMET |
1494020619666__PRED_437805.txt |
Automatic |
0.0826 |
20 |
Biomedical Computer Science Group |
1493885614229__PRED_BCSG_Sub09.csv |
Automatic |
0.0749 |
21 |
Biomedical Computer Science Group |
1493885575289__PRED_BCSG_Sub08.csv |
Automatic |
0.0675 |
22 |
BMET |
1493701062845__PRED_1499176.txt |
Automatic |
0.0656 |
23 |
Biomedical Computer Science Group |
1493885210021__PRED_BCSG_Sub01.csv |
Automatic |
0.0624 |
24 |
Biomedical Computer Science Group |
1493885397459__PRED_BCSG_Sub04.csv |
Automatic |
0.0537 |
25 |
Biomedical Computer Science Group |
1493885352146__PRED_BCSG_Sub03.csv |
Automatic |
0.0527 |
26 |
Biomedical Computer Science Group |
1493885286358__PRED_BCSG_Sub02.csv |
Automatic |
0.0411 |
27 |
Biomedical Computer Science Group |
1493885541193__PRED_BCSG_Sub07.csv |
Automatic |
0.0375 |
28 |
Biomedical Computer Science Group |
1493885499624__PRED_BCSG_Sub06.csv |
Automatic |
0.0365 |
29 |
Biomedical Computer Science Group |
1493885708424__PRED_BCSG_Sub10.csv |
Automatic |
0.0326 |
30 |
Biomedical Computer Science Group |
1493885450000__PRED_BCSG_Sub05.csv |
Automatic
|
0.0200 |
31 |
Concept Detection - No External Resources Used
|
Group Name |
Run |
Run Type |
Mean F1 Score |
Rank |
Aegean AI Lab |
1491857120689__DET_ConceptDetectionTesting2017-results.txt |
Automatic |
0.1583 |
1 |
Information Processing Laboratory |
1494006128917__DET_LFS_PKNN_DSIFT_GBOC |
Automatic |
0.1436 |
2 |
Information Processing Laboratory |
1494006074473__DET_LFS_PKNN_CEDD4x4_DSIFT_GBOC |
Automatic |
0.1418 |
3 |
Information Processing Laboratory |
1494009510297__DET_LFS_RWR_DSIFT_GBOC |
Automatic |
0.1417 |
4 |
Information Processing Laboratory |
1494006054264__DET_LFS_PKNN_FCTH4x4_DSIFT_GBOC |
Automatic |
0.1415 |
5 |
Information Processing Laboratory |
1494009412127__DET_LFS_RWR_CEDD4x4_DSIFT_GBOC |
Automatic |
0.1414 |
6 |
Information Processing Laboratory |
1494009455073__DET_LFS_RWR_FCTH4x4_DSIFT_GBOC |
Automatic |
0.1394 |
7 |
Information Processing Laboratory |
1494006225031__DET_RWR_DSift_Top100_L2_SqrtNorm_L1Norm.txt |
Automatic |
0.1365 |
8 |
Information Processing Laboratory |
1494006181689__DET_PKNN_DSift_Top100_L2_SqrtNorm_L1Norm.txt |
Automatic |
0.1364 |
9 |
Information Processing Laboratory |
1494006414840__DET_RWR_gboc_Top100_L2_SqrtNorm_L1Norm.txt |
Automatic |
0.1212 |
10 |
Information Processing Laboratory |
1494006360623__DET_PKNN_gboc_Top100_L2_SqrtNorm_L1Norm.txt |
Automatic |
0.1208 |
11 |
MEDGIFT UPB |
1496826981029__DET_CORRECTED_medgift_baseline.txt |
Automatic |
0.0893 |
12 |
NLM |
1494013963830__DET_run_8_comb1_CNN2.txt |
Automatic |
0.0880 |
13 |
NLM |
1494014008563__DET_run_9_comb2_CNN2Meka.txt |
Automatic |
0.0868 |
14 |
NLM |
1494013621939__DET_run_6_CNN_GoogLeNet_92Cuis.txt |
Automatic |
0.0811 |
15 |
NLM |
1494013664037__DET_run_7_CNN_GoogLeNet_239Cuis.txt |
Automatic |
0.0695 |
16 |
mami |
1496127572481__DET_CORRECTED_mami_resulat.txt |
Feedback or/and human assistance |
0.0462 |
17 |
MEDGIFT UPB |
1493803509469__DET_ResNet152_SCEL_t_0.06.txt |
Automatic |
0.0028 |
18 |
NLM |
1494012725738__DET_run_5_Meka_CEDD.txt |
Automatic |
0.0012 |
19 |
mami |
1493631868847__DET_submisionlotof0.txt |
Feedback or/and human assistance |
0.0000 |
20 |
Concept Detection - External Resources Used, No Test Data Included
|
Group Name |
Run |
Run Type |
Mean F1 Score |
Rank |
NLM |
1495446212270__DET_X_Concept_run_1_baseline.txt |
Automatic |
0.0162 |
1 |
Concept Detection - External Resources Used, Test Data Potentially Included
|
Group Name |
Run |
Run Type |
Mean F1 Score |
Rank |
NLM |
1494012568180__DET_run_1_openI_MetaMapLite_1.txt |
Automatic |
0.1718 |
1 |
NLM |
1494012586539__DET_run_2_openI_MetaMapLite_2.txt |
Automatic |
0.1648 |
2
|
NLM |
1494014122269__DET_run_10_comb3_CNN2MekaOpenI.txt |
Automatic |
0.1390 |
3 |
NLM |
1494012605475__DET_run_3_openI_MetaMapLite_3.txt |
Automatic |
0.1228 |
4 |
Concept Detection - Unknown
|
Group Name |
Run |
Run Type |
Mean F1 Score |
Rank |
AILAB |
1493823116836__DET_prna_run1_processed.txt |
Automatic |
0.1208 |
13 |
BMET |
1493791786709__DET_merge_01.txt |
Automatic |
0.0958 |
15 |
BMET |
1493791318971__DET_3616832.txt |
Automatic |
0.0880 |
16 |
BMET |
1493698613574__DET_958069.txt |
Automatic |
0.0838 |
19 |
Morgan CS |
1494060724020__DET_Morgan_result_concept_from_train_Kmean300_top15.csv |
Manual |
0.0498 |
22 |
BioinformaticsUA |
1493841144834__DET_0503192045.txt |
Not applicable |
0.0488 |
23 |
BioinformaticsUA |
1493995613907__DET_0504234124-0.txt |
Not applicable |
0.0463 |
24 |
Morgan CS |
1494049613114__DET_Morgan_result_concept_from_val_Kmean50_top15.csv |
Not applicable |
0.0461 |
25 |
Morgan CS |
1494048615677__DET_Morgan_result_concept_from_train_Kmean_top20.csv |
Not applicable |
0.0434 |
26 |
BioinformaticsUA |
1493976564810__DET_0505041340-0.txt |
Not applicable |
0.0414 |
27 |
Morgan CS |
1494048330426__DET_Morgan_result_concept_from_CBIR.csv |
Automatic |
0.0273 |
28 |
AILAB |
1493823633136__DET_prna_run2_processed.txt |
Automatic |
0.0234 |
29 |
AILAB |
1493823760708__DET_prna_run3_processed.txt |
Automatic |
0.0215 |
30 |
Citations
- When referring to the ImageCLEFcaption 2017 task general goals, general results, etc. please cite the following publication which will be published by September 2017:
-
Carsten Eickhoff, Immanuel Schwall, Alba García Seco de Herrera and Henning Müller. Overview of ImageCLEFcaption 2017 - the Image Caption Prediction and Concept Extraction Tasks to Understand Biomedical Images, CLEF working notes, CEUR, 2017.
-
BibTex:
@Inproceedings{ImageCLEFoverview2017,
author = {Eickhoff, Carsten and Schwall, Immanuel and Garc\'ia Seco de Herrera, Alba and M\"uller, Henning},
title = {Overview of {ImageCLEFcaption} 2017 - the Image Caption Prediction and Concept Extraction Tasks to Understand Biomedical Images},
booktitle = {CLEF2017 Working Notes},
series = {{CEUR} Workshop Proceedings},
year = {2017},
volume = {},
publisher = {CEUR-WS.org $<$http://ceur-ws.org$>$},
pages = {},
month = {September 11-14},
address = {Dublin, Ireland},
}
Contact
- Carsten Eickhoff <c.eickhoff@acm.org>, ETH Zurich, Switzerland
- Immanuel Schwall <manuel.schwall@gmail.com>, ETH Zurich, Switzerland
- Alba García Seco de Herrera<albagarcia@nih.gov>, National Library of Medicine (NLM/NIH), Bethesda, MD, USA
- Henning Müller <henning.mueller@hevs.ch>, University of Applied Sciences Western Switzerland, Sierre, Switzerland
Join our mailing list: https://groups.google.com/d/forum/imageclefcaption
Acknowledgements