######################
PlantCLEF2022 datasets:
######################

Training datasets (public access on February 15, 2022):

There are two training datasets ("trusted" and "web"):

 - the "trusted" training dataset is based on a selection of more than 2.9M images covering 80k plant species shared and collected mainly by GBIF (and EOL to a lesser extent). These images come mainly from academic sources (museums, universities, national institutions) and collaborative platforms such as inaturalist or Pl@ntNet, implying a fairly high certainty of determination quality. Nowadays, many more photographs are available on these platforms for a few thousand species, but the number of images has been globally limited to around 100 images per species, favouring types of views adapted to the identification of plants (close-ups of flowers, fruits, leaves, trunks, ...), in order to not unbalance the classes and to not explode the size of the training dataset.

 - in contrast, the "web" training dataset is based on a collection of web images provided by search engines Google and Bing. The initial collection of several million images suffers however from a significant rate of species identification errors and a massive presence of duplicates and images less adapted for visual identification of plants (herbariums, landscapes, microscopic views...), or even off-topic (portrait photos of botanists, maps, graphs, other kingdoms of the living, manufactured objects, ...). The initial collection has been then semi-automatically revised to drastically reduce the number of these irrelevant pictures and to maximise, as for the trusted dataset, close-ups of flowers, fruits, leaves, trunks, etc. The "web" dataset finally contains about 1.1 million images covering around 57k species.

Both datasets are available here:
https://lab.plantnet.org/LifeCLEF/PlantCLEF2022/train/  

First, download the metadata files related to the two datasets:  
https://lab.plantnet.org/LifeCLEF/PlantCLEF2022/train/trusted/PlantCLEF2022_trusted_training_metadata.tar.gz
https://lab.plantnet.org/LifeCLEF/PlantCLEF2022/train/web/PlantCLEF2022_web_training_metadata.tar.gz

Here is a brief presentation of the metadata in each column: 
 - "classid": gives one identifier per class (species), there are a total of 80k classids representing 80k species. These classids are also used to name subdirectories containing the images of the associated species. The classid are also the species identifiers used by the gbif. 
 - "image_name": filename of the images (jpg format, resized to 600 pixels max)
 - "image_path": is the contatenation of the classid and the filename 
 - "species" is the corresponding name of the classid
 - "genus", "family", "order", "class" represent the higher levels in the taxonomy. This can be useful for training intermediate image classification models with fewer classes (e.g. a genus classifier), which could then be finetuned to the lower level species
 - "manual_tag": annotation of images made by humans representing the type of view (flower, leaf, fruit, bark, etc)
 - "predicted_tag", "predicted_tag_probability": annotation of images made by a CNN dedicated to predict the type of view (flower, leaf, fruit, bark, etc)
 - "original_url": the original url related to the image, many links can be inactive, that's why backup url are proposed 
 - "image_backup_url": the url of an image that can be used for downloading an image
 - "source": a link to the original plant observation
 - "publisher": instution or project where the plant observation was published
 - "gbif_occurrence_id": occurrence (= plant observation) id when it was accessed through gbif
 - "dataset_key": gbif dataset key related to a publisher
 - "license": license
 - "aggregator": gibf or eol

Important: several images can be associated to one same plant observation (or occurrence), please consider the gbif_occurence_id and the source fields if you plan to split the dataset into a training and a validation set. Do not apply a random split at the image level, but at the observation level, otherwise you risk introducing biases in your evaluations.

Images are organized into subdirectories, where each subdirectory is named after the classid of the metadata file. For the "trusted" dataset there are thus 80,000 subdirectories corresponding to 80,000 classes (species), and 57,314 subdirectories for the second "web" dataset. 
There are two ways to get the images:
- option A: download the tar files directly, where each tar file contains a subset of the subdirectories. For example, the file PlantCLEF2022_web_training_images_2.tar.gz contains all images related to the classes (classid) starting with a "2".
- option B: "backup" url images files are indicated in the last column "image_backup_url" of the metadata file, you can use your favorite wget or curl command or a script to download images one by one. This can also be useful if you want to download only a subset of species for initial experiments (e.g. only species of the orchid family, or only images that are associated with GPS coordinates, etc).

For option A, here is the list of tar files:
https://lab.plantnet.org/LifeCLEF/PlantCLEF2022/train/trusted/PlantCLEF2022_trusted_training_images_1.tar.gz
https://lab.plantnet.org/LifeCLEF/PlantCLEF2022/train/trusted/PlantCLEF2022_trusted_training_images_2.tar.gz
https://lab.plantnet.org/LifeCLEF/PlantCLEF2022/train/trusted/PlantCLEF2022_trusted_training_images_3.tar.gz
https://lab.plantnet.org/LifeCLEF/PlantCLEF2022/train/trusted/PlantCLEF2022_trusted_training_images_4.tar.gz
https://lab.plantnet.org/LifeCLEF/PlantCLEF2022/train/trusted/PlantCLEF2022_trusted_training_images_5.tar.gz
https://lab.plantnet.org/LifeCLEF/PlantCLEF2022/train/trusted/PlantCLEF2022_trusted_training_images_6.tar.gz
https://lab.plantnet.org/LifeCLEF/PlantCLEF2022/train/trusted/PlantCLEF2022_trusted_training_images_7.tar.gz
https://lab.plantnet.org/LifeCLEF/PlantCLEF2022/train/trusted/PlantCLEF2022_trusted_training_images_8.tar.gz
https://lab.plantnet.org/LifeCLEF/PlantCLEF2022/train/trusted/PlantCLEF2022_trusted_training_images_9.tar.gz

https://lab.plantnet.org/LifeCLEF/PlantCLEF2022/train/web/PlantCLEF2022_web_training_images_1.tar.gz
https://lab.plantnet.org/LifeCLEF/PlantCLEF2022/train/web/PlantCLEF2022_web_training_images_2.tar.gz
https://lab.plantnet.org/LifeCLEF/PlantCLEF2022/train/web/PlantCLEF2022_web_training_images_3.tar.gz
https://lab.plantnet.org/LifeCLEF/PlantCLEF2022/train/web/PlantCLEF2022_web_training_images_4.tar.gz
https://lab.plantnet.org/LifeCLEF/PlantCLEF2022/train/web/PlantCLEF2022_web_training_images_5.tar.gz
https://lab.plantnet.org/LifeCLEF/PlantCLEF2022/train/web/PlantCLEF2022_web_training_images_6.tar.gz
https://lab.plantnet.org/LifeCLEF/PlantCLEF2022/train/web/PlantCLEF2022_web_training_images_7.tar.gz
https://lab.plantnet.org/LifeCLEF/PlantCLEF2022/train/web/PlantCLEF2022_web_training_images_8.tar.gz
https://lab.plantnet.org/LifeCLEF/PlantCLEF2022/train/web/PlantCLEF2022_web_training_images_9.tar.gz