Skip to main content
bioRxiv
  • Home
  • About
  • Submit
  • ALERTS / RSS
Advanced Search
New Results

EXTRACT 2.0: text-mining-assisted interactive annotation of biomedical named entities and ontology terms

View ORCID ProfileEvangelos Pafilis, Rūdolfs Bērziņš, View ORCID ProfileLars Juhl Jensen
doi: https://doi.org/10.1101/111088
Evangelos Pafilis
1Institute of Marine Biology, Biotechnology and Aquaculture, Hellenic Centre for Marine Research, P.O. Box 2214, 71003 Heraklion, Crete, Greece
  • Find this author on Google Scholar
  • Find this author on PubMed
  • Search for this author on this site
  • ORCID record for Evangelos Pafilis
  • For correspondence: pafilis@hcmr.gr lars.juhl.jensen@cpr.ku.dk
Rūdolfs Bērziņš
2Cellular Network Biology Group, Novo Nordisk Foundation Center for Protein Research, Faculty of Health and Medical Sciences, University of Copenhagen, 2200 Copenhagen N, Denmark
  • Find this author on Google Scholar
  • Find this author on PubMed
  • Search for this author on this site
Lars Juhl Jensen
2Cellular Network Biology Group, Novo Nordisk Foundation Center for Protein Research, Faculty of Health and Medical Sciences, University of Copenhagen, 2200 Copenhagen N, Denmark
  • Find this author on Google Scholar
  • Find this author on PubMed
  • Search for this author on this site
  • ORCID record for Lars Juhl Jensen
  • For correspondence: pafilis@hcmr.gr lars.juhl.jensen@cpr.ku.dk
  • Abstract
  • Full Text
  • Info/History
  • Metrics
  • Preview PDF
Loading

1 INTRODUCTION

Databases increasingly rely on text-mining tools to support the curation process. The BioCreative interactive annotation task recently evaluated several such tools and found our tool EXTRACT to perform favorably in terms of usability and accelerated curation by 15–25% (Wang et al, 2016).

The original version of EXTRACT was designed to support annotation of metagenomic samples with semantically controlled environmental descriptors (Pafilis et al., 2016). For this reason, it focused on named entity recognition of terms from the Environment Ontology (ENVO) (Buttigieg et al., 2016) and ontologies relevant for describing host organisms (https://www.ncbi.nlm.nih.gov/taxonomy), tissues (Placzek et al, 2017), and disease states (Kibbe et al., 2015).

2 EXPANDED SCOPE OF THE TOOL

EXTRACT 2.0 expands the scope of the tool in several new directions with the aim to make it more broadly useful.

We expanded the scope from covering only diseases to covering phenotypes in general. To this end, we complemented the existing disease dictionary with terms from the Mammalian Phenotype Ontology (MPO) (Smith and Eppig, 2012). To avoid redundancy in the dictionary, we excluded MPO terms that clashed with terms already in the disease dictionary. To improve recall, we added plural and adjective endings to the names and generated variants of the form pronoun of noun from names of the form noun pronoun.

To cover also important concepts of molecular and cellular biology, we further expanded the dictionary with Gene Ontology (GO). The names from GO were processed similar to those from MPO to generate variants and improve recall.

In addition to adding more biomedical ontologies, we have expanded the tool with named entity recognition of molecular entities. To this end, we have included dictionaries of protein-coding and non-coding RNA (ncRNA) genes from STRING (Szkararczyk et al., 2017) and RAIN (Junge et al., 2017), respectively. We have furthermore added a dictionary of drugs and other small molecule compounds from the STITCH database (Szklarczyk et al., 2016).

Together, these additional types of entities have made EXTRACT 2.0 potentially useful for many more tasks than just annotation of metagenomic samples. For example, it can be used to help annotate both proteins and ncRNAs with functions, processes, subcellular localization, tissue expression, and associated diseases. The tool, API, and documentation are freely accessible at http://extract.jensenlab.org.

FUNDING

The Novo Nordisk Foundation (NNF14CC0001).

REFERENCES

  1. ↵
    Wang, Q., Abdul, S.S., Almeida, L. et al. (2016). Overview of the interactive task in BioCreative V. Database, 2016, bawl19.
  2. ↵
    Pafilis, E., Buttigieg, P.L., Ferrell, B. et al. (2016) EXTRACT: interactive extraction of environment metadata and term suggestion for metagenomic sample annotation. Database, 2016, baw005.
  3. ↵
    Buttigieg, P.L., Pafilis, E., Lewis, S.E. et al. (2016) The environment ontology in 2016: bridging domains with increased scope, semantic density, and interoperation. J. Biomed. Semant.,7, 57.
    OpenUrl
  4. ↵
    Placzek, S., Schomburg, I., Chang, A. et al. (2017) BRENDA in 2017: new perspectives and new tools in BRENDA. Nucleic Acids Res., 45, D380–D388.
    OpenUrlCrossRefPubMed
  5. ↵
    Kibbe, W.A., Arze, C., Felix, V. et al. (2015) Disease Ontology 2015 update: an expanded and updated database of human diseases for linking biomedical knowledge through disease data. Nucleic Acids Res., 43, D1071–D1078.
    OpenUrlCrossRefPubMed
  6. ↵
    Smith, C.L. and Eppig, J.T. (2012) The Mammalian Phenotype Ontology as a unifying standard for experimental and high-throughput phenotyping data. Mamm. Genome, 23, 654–668.
    OpenUrl
  7. ↵
    Szklarczyk, D., Morris, J.H., Cook, H. et al. (2017). The STRING database in 2017: quality-controlled protein-protein association. Nucleic Acids Res., 45, D362–D368.
    OpenUrlCrossRefPubMed
  8. ↵
    Junge, A., Refsgaard,, J.C., Garde, C. et al. (2017). RAIN: RNA–protein Association and Interaction Networks. Database, 2017, bawl67.
  9. ↵
    Szklarczyk, D., Santos, A., von Mering, C. et al. (2016). STITCH 5: augmenting protein-chemical interaction networks with tissue and affinity data. Nucleic Acids Res., 44, D380–D384.
    OpenUrlCrossRefPubMed
Back to top
PreviousNext
Posted February 23, 2017.
Download PDF
Email

Thank you for your interest in spreading the word about bioRxiv.

NOTE: Your email address is requested solely to identify you as the sender of this article.

Enter multiple addresses on separate lines or separate them with commas.
EXTRACT 2.0: text-mining-assisted interactive annotation of biomedical named entities and ontology terms
(Your Name) has forwarded a page to you from bioRxiv
(Your Name) thought you would like to see this page from the bioRxiv website.
CAPTCHA
This question is for testing whether or not you are a human visitor and to prevent automated spam submissions.
Share
EXTRACT 2.0: text-mining-assisted interactive annotation of biomedical named entities and ontology terms
Evangelos Pafilis, Rūdolfs Bērziņš, Lars Juhl Jensen
bioRxiv 111088; doi: https://doi.org/10.1101/111088
Reddit logo Twitter logo Facebook logo LinkedIn logo Mendeley logo
Citation Tools
EXTRACT 2.0: text-mining-assisted interactive annotation of biomedical named entities and ontology terms
Evangelos Pafilis, Rūdolfs Bērziņš, Lars Juhl Jensen
bioRxiv 111088; doi: https://doi.org/10.1101/111088

Citation Manager Formats

  • BibTeX
  • Bookends
  • EasyBib
  • EndNote (tagged)
  • EndNote 8 (xml)
  • Medlars
  • Mendeley
  • Papers
  • RefWorks Tagged
  • Ref Manager
  • RIS
  • Zotero
  • Tweet Widget
  • Facebook Like
  • Google Plus One

Subject Area

  • Bioinformatics
Subject Areas
All Articles
  • Animal Behavior and Cognition (4382)
  • Biochemistry (9591)
  • Bioengineering (7090)
  • Bioinformatics (24856)
  • Biophysics (12600)
  • Cancer Biology (9955)
  • Cell Biology (14349)
  • Clinical Trials (138)
  • Developmental Biology (7948)
  • Ecology (12105)
  • Epidemiology (2067)
  • Evolutionary Biology (15988)
  • Genetics (10925)
  • Genomics (14738)
  • Immunology (9869)
  • Microbiology (23659)
  • Molecular Biology (9484)
  • Neuroscience (50855)
  • Paleontology (369)
  • Pathology (1539)
  • Pharmacology and Toxicology (2681)
  • Physiology (4013)
  • Plant Biology (8657)
  • Scientific Communication and Education (1508)
  • Synthetic Biology (2394)
  • Systems Biology (6433)
  • Zoology (1346)