RT Journal Article
SR Electronic
T1 Bioentities: a resource for entity recognition and relationship resolution in biomedical text mining
JF bioRxiv
FD Cold Spring Harbor Laboratory
SP 225698
DO 10.1101/225698
A1 John A. Bachman
A1 Benjamin M. Gyori
A1 Peter K. Sorger
YR 2017
UL http://biorxiv.org/content/early/2017/11/27/225698.abstract
AB Background For automated reading of scientific publications to extract useful information about molecular mechanisms it is critical that genes, proteins and other entities be correctly associated with uniform identifiers, a process known as named entity linking or “grounding.” Correct entity identification is essential for resolving relationships between mined information, curated interaction databases, and biological datasets. The accuracy of this process is largely dependent on the availability of resources allowing computers to link commonly-used abbreviations and synonyms found in literature to uniform identifiers.Results In a task involving automated reading of ~215,000 articles using the REACH event extraction software we found that grounding was disproportionately inaccurate for multi-protein families (e.g., “AKT”) and named complexes that involve multiple molecular entities (e.g. “NF-κB”). To address this problem we created Bioentities, a manually curated resource defining protein families and complexes as they are commonly referred to in text. In Bioentities the gene-level constituents of families and complexes are defined in a flexible format allowing for multi-level, hierarchical membership. To create Bioentities, text strings corresponding to entities were identified empirically from literature and linked manually to uniform identifiers; these identifiers were also mapped to equivalent entries in multiple related databases. Bioentities also includes curated prefix and suffix patterns that improve named entity recognition and event extraction. Evaluation on a distinct corpus of ~54,000 articles showed that incorporating Bioentities into the set of databases used for grounding significantly increased grounding accuracy for families and complexes, from 15 to 71%. The hierarchical organization of entities also made it possible to integrate otherwise unconnected mechanistic information across families, subfamilies, and individual proteins.Conclusion Bioentities is an effective tool for improving named entity recognition, grounding, and relationship resolution in automated reading of biomedical text. The content in Bioentities is available in both tabular and Open Biomedical Ontology formats at https://github.com/sorgerlab/bioentities under the Creative Commons CC0 license and has been integrated into the TRIPS/DRUM and REACH reading systems.