Skip to main content
bioRxiv
  • Home
  • About
  • Submit
  • ALERTS / RSS
Advanced Search
New Results

Annotation of phenotypes using ontologies: a Gold Standard for the training and evaluation of natural language processing systems

View ORCID ProfileWasila Dahdul, View ORCID ProfilePrashanti Manda, Hong Cui, View ORCID ProfileJames P. Balhoff, View ORCID ProfileT. Alexander Dececchi, Nizar Ibrahim, View ORCID ProfileHilmar Lapp, View ORCID ProfileTodd Vision, View ORCID ProfilePaula M. Mabee
doi: https://doi.org/10.1101/322156
Wasila Dahdul
1University of South Dakota
  • Find this author on Google Scholar
  • Find this author on PubMed
  • Search for this author on this site
  • ORCID record for Wasila Dahdul
Prashanti Manda
2University of North Carolina at Greensboro
  • Find this author on Google Scholar
  • Find this author on PubMed
  • Search for this author on this site
  • ORCID record for Prashanti Manda
Hong Cui
3University of Arizona
  • Find this author on Google Scholar
  • Find this author on PubMed
  • Search for this author on this site
James P. Balhoff
4University of North Carolina at Chapel Hill
  • Find this author on Google Scholar
  • Find this author on PubMed
  • Search for this author on this site
  • ORCID record for James P. Balhoff
T. Alexander Dececchi
1University of South Dakota
  • Find this author on Google Scholar
  • Find this author on PubMed
  • Search for this author on this site
  • ORCID record for T. Alexander Dececchi
Nizar Ibrahim
5University of Chicago
  • Find this author on Google Scholar
  • Find this author on PubMed
  • Search for this author on this site
Hilmar Lapp
6Duke University
  • Find this author on Google Scholar
  • Find this author on PubMed
  • Search for this author on this site
  • ORCID record for Hilmar Lapp
Todd Vision
4University of North Carolina at Chapel Hill
  • Find this author on Google Scholar
  • Find this author on PubMed
  • Search for this author on this site
  • ORCID record for Todd Vision
Paula M. Mabee
1University of South Dakota
  • Find this author on Google Scholar
  • Find this author on PubMed
  • Search for this author on this site
  • ORCID record for Paula M. Mabee
  • Abstract
  • Full Text
  • Info/History
  • Metrics
  • Supplementary material
  • Preview PDF
Loading

Abstract

Natural language descriptions of organismal phenotypes - a principal object of study in biology, are abundant in biological literature. Expressing these phenotypes as logical statements using formal ontologies would enable large-scale analysis on phenotypic information from diverse systems. However, considerable human effort is required to make the semantics of phenotype descriptions amenable to machine reasoning by (a) recognizing appropriate on-tological terms for entities in text and (b) stringing these terms into logical statements. Most existing Natural Language Processing tools stop at entity recognition, leaving a need for tools that can assist with both aspects of the task. The recently described Semantic CharaParser aims to meet this need. We describe the first expert-curated Gold Standard corpus for ontology-based annotation of phenotypes from the systematics literature. We use it to evaluate Semantic CharaParser’s annotations and explore differences in performance between humans and machine. We use four annotation accuracy metrics that can account for both semantically identical and similar matches. We found that machine-human consistency was significantly lower than inter-curator (human–human) consistency. Surprisingly, allowing curators access to external information that was not available to Semantic CharaParser did not significantly increase the similarity of their annotations to the Gold Standard nor have a significant effect on inter-curator consistency. We found that the similarity of machine annotations to the Gold Standard increased after new ontology terms relevant to the input text had been added. Evaluation by the original authors of the character descriptions indicated that the Gold Standard annotations came closer to representing their intended meaning than did either the curator or machine annotations. These findings point toward ways to better design of software to augment human curators, and the Gold Standard corpus will allow training and assessment of new tools to improve phenotype annotation accuracy at scale.

Footnotes

Copyright 
The copyright holder for this preprint is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under a CC-BY 4.0 International license.
Back to top
PreviousNext
Posted May 15, 2018.
Download PDF

Supplementary Material

Email

Thank you for your interest in spreading the word about bioRxiv.

NOTE: Your email address is requested solely to identify you as the sender of this article.

Enter multiple addresses on separate lines or separate them with commas.
Annotation of phenotypes using ontologies: a Gold Standard for the training and evaluation of natural language processing systems
(Your Name) has forwarded a page to you from bioRxiv
(Your Name) thought you would like to see this page from the bioRxiv website.
CAPTCHA
This question is for testing whether or not you are a human visitor and to prevent automated spam submissions.
Share
Annotation of phenotypes using ontologies: a Gold Standard for the training and evaluation of natural language processing systems
Wasila Dahdul, Prashanti Manda, Hong Cui, James P. Balhoff, T. Alexander Dececchi, Nizar Ibrahim, Hilmar Lapp, Todd Vision, Paula M. Mabee
bioRxiv 322156; doi: https://doi.org/10.1101/322156
Digg logo Reddit logo Twitter logo Facebook logo Google logo LinkedIn logo Mendeley logo
Citation Tools
Annotation of phenotypes using ontologies: a Gold Standard for the training and evaluation of natural language processing systems
Wasila Dahdul, Prashanti Manda, Hong Cui, James P. Balhoff, T. Alexander Dececchi, Nizar Ibrahim, Hilmar Lapp, Todd Vision, Paula M. Mabee
bioRxiv 322156; doi: https://doi.org/10.1101/322156

Citation Manager Formats

  • BibTeX
  • Bookends
  • EasyBib
  • EndNote (tagged)
  • EndNote 8 (xml)
  • Medlars
  • Mendeley
  • Papers
  • RefWorks Tagged
  • Ref Manager
  • RIS
  • Zotero
  • Tweet Widget
  • Facebook Like
  • Google Plus One

Subject Area

  • Evolutionary Biology
Subject Areas
All Articles
  • Animal Behavior and Cognition (4094)
  • Biochemistry (8784)
  • Bioengineering (6490)
  • Bioinformatics (23377)
  • Biophysics (11761)
  • Cancer Biology (9164)
  • Cell Biology (13267)
  • Clinical Trials (138)
  • Developmental Biology (7420)
  • Ecology (11380)
  • Epidemiology (2066)
  • Evolutionary Biology (15110)
  • Genetics (10408)
  • Genomics (14017)
  • Immunology (9133)
  • Microbiology (22086)
  • Molecular Biology (8792)
  • Neuroscience (47418)
  • Paleontology (350)
  • Pathology (1421)
  • Pharmacology and Toxicology (2483)
  • Physiology (3710)
  • Plant Biology (8060)
  • Scientific Communication and Education (1433)
  • Synthetic Biology (2213)
  • Systems Biology (6019)
  • Zoology (1251)