EOL Dynamic Hierarchy Data Sets
The Encyclopedia of Life (EOL, eol.org) aggregates biodiversity information from more than 400 sources and provides access to the data through taxon pages, visual query and application programming interfaces. Scientific names are essential elements of the data integration infrastructure, but their shortcomings as key identifiers are well documented (Patterson et al., 2016). Complex automated workflows and continuous manual curation are required to address idiosyncrasies of source taxonomies, variation in data quality, and conflicting taxonomic opinions.
To achieve a harmonized taxonomic view of EOL content, names from data sources are mapped to a dynamic reference hierarchy using an algorithm that leverages canonical name strings, hierarchical information (ancestry, descendants), taxonomic ranks, synonym data, and author strings. Names that cannot be associated with a reference taxon are still accessible, but their unmapped status excludes them and any associated content from certain core EOL functions.
The EOL dynamic hierarchy relies heavily on the Catalogue of Life as a data source, but it customizes and extends the data set to meet requirements for EOL use cases, including:
- Navigation across EOL taxon pages
- Assembly and synthesis of information on taxon pages
- Autogeneration of natural language taxon descriptions
- Trait data queries via graphical user & application programming interfaces
- Taxonomic extrapolation of trait data
- Taxonomy-based reporting & analytics
To optimally support EOL applications, the reference hierarchy should be as complete as possible, include both living and extinct organisms, and follow the most up-to-date sources for different branches of the tree of life. It should also present a modern view of biodiversity, informed by phylogeny, but it must be simple enough to remain usable by non-specialist audiences. Attempts to address all of these project needs will invariably require a number of compromises that need to be evaluated on an ongoing basis as the EOL reference hierarchy grows and matures.
The EOL reference hierarchy currently features 2.3 million taxa. Since the Catalogue of Life (COL) is the most comprehensive taxonomic resource available, it is well suited to serve as a reference for the bulk of the EOL hierarchy (92% of taxa). In order to optimize the hierarchy for EOL applications, we modify and augment the COL data as follows:
Grafting a new trunk: We reorganize the trunk of the hierarchy so the relationships of higher taxa reflect modern phylogenetic knowledge as much as possible, minimizing the use of paraphyletic and polyphyletic taxa like “Zoomycota” or “Chromista.” Major sources for the trunk include the Open Tree of Life project (Hinchliff et al., 2015) and recent literature, e.g., Adl et al. (2019) for microbial eukaryotes.
Replacing branches: For some groups there are high quality classifications available that have not (yet) been integrated into COL (e.g., IOC World Birld List, World Odonata List, or that have not been updated in COL in several years, e.g., Amphibian Species of the World, International Committee on Taxonomy of Viruses. If we have access to these resources and they provide complete coverage, we prune the branch from COL and use the more up-to-date classification as a sole source reference for the branch.
Filling gaps: For some groups COL coverage is incomplete and there are no comprehensive classifications available, but we can use other sources like NCBI for microbes, WoRMS for marine organisms, and ITIS for bees and mammals to supplement COL data. We also create preliminary custom patches based on available literature, e.g., for earthworms and several insect groups.
The creation of each version of the EOL reference hierarchy involves a workflow with both automated and manual components. It includes the following steps:
Manual curation of the hierarchy trunk which determines the relationships among higher taxa. We use the Taxonomic Tree Tool (TTT, Ji, 2017) developed by the Chinese Academy of Sciences to compare different versions and manually edit the trunk structure.
Preprocessing of source hierarchies. A fair amount of work is required to get each data set ready for import. We use gnparser (Mozzherin et al., 2017) to derive the canonical forms of name strings and apply a series of custom scripts to prune unwanted taxa and standardize taxonomic data formats.
Automated assembly of the full hierarchy is accomplished using software developed by the Open Tree of Life project (OTL, Rees & Cranston, 2017). It provides a transparent, repeatable process for the merging of taxa from partially overlapping source taxonomies into a synthetic hierarchy, and it makes it easy to incorporate new versions as sources get updated.
Postprocessing of the synthetic hierarchy. The synthetic hierarchy is examined for remaining inconsistencies, errors are corrected, links to reference taxa in source hierarchies are added, and final adjustments are made to optimize the hierarchy for EOL use. For example, in an effort to keep the number of immediate descendants of high-ranking groups within certain limits, we move large clusters of taxa of uncertain placement to artificial “unclassified” groupings.
The EOL reference hierarchy is a work in progress. It has many known flaws and is far from complete for many important groups. For example, many fossil groups are not yet well represented and there is much work left to be done to adequately represent many microbial taxa as well as hyperdiverse invertebrate groups like mites and various groups of insects. Since this system supports frequent semi-automated updates of the EOL reference hierarchy, it will allow us to leverage newly available taxonomic resources, and we will be able to accommodate community feedback about potential improvements.
Adl, S. M., et al. 2019. Revisions to the classification, nomenclature, and diversity of eukaryotes. Journal of Eukaryotic Microbiology 66:4–119. doi:10.1111/jeu.12691
Hinchliff, C. E., et al. 2015. Synthesis of phylogeny and taxonomy into a comprehensive tree of life. Proceedings of the National Academy of Sciences 112: 12764-12769. doi:10.1073/pnas.1423041112
Ji, L. 2017. Catalogue of Life, China and Taxonomic Tree Tool. Proceedings of TDWG 1: e20394. doi:10.3897/tdwgproceedings.1.20394
Mozzherin, D.Y., Myltsev, A.A., and Patterson, D.J. 2017. “gnparser”: a powerful parser for scientific names based on Parsing Expression Grammar. BMC Bioinformatics 18:279. doi:10.1186/s12859-017-1663-3
Patterson, D., et al. 2016. Challenges with using names to link digital biodiversity information. Biodiversity Data Journal 4: e8080. doi:10.3897/BDJ.4.e8080
Rees, J. A. & Cranston, K. 2017. Automated assembly of a reference taxonomy for phylogenetic data synthesis. Biodiversity Data Journal 5:e12581. doi:10.3897/BDJ.5.e12581
Cite this page as:
Encyclopedia of Life. 2022. Encyclopedia of Life Dynamic Hierarchy Data Sets. Available from https://opendata.eol.org/organization/about/dynamic-hierarchy