Skip to main content
bioRxiv
  • Home
  • About
  • Submit
  • ALERTS / RSS
Advanced Search
New Results

GOLabeler: Improving Sequence-based Large-scale Protein Function Prediction by Learning to Rank

Ronghui You, Zihan Zhang, Yi Xiong, Fengzhu Sun, Hiroshi Mamitsuka, Shangfeng Zhu
doi: https://doi.org/10.1101/145763
Ronghui You
1School of Computer Science and Shanghai Key Lab of Intelligent Information Processing and
2Centre for Computational System Biology, Fudan University, Shanghai 200433, China,
  • Find this author on Google Scholar
  • Find this author on PubMed
  • Search for this author on this site
Zihan Zhang
1School of Computer Science and Shanghai Key Lab of Intelligent Information Processing and
2Centre for Computational System Biology, Fudan University, Shanghai 200433, China,
  • Find this author on Google Scholar
  • Find this author on PubMed
  • Search for this author on this site
Yi Xiong
3Department of Bioinformatics and Biostatistics, Shanghai Jiaotong University,
  • Find this author on Google Scholar
  • Find this author on PubMed
  • Search for this author on this site
Fengzhu Sun
2Centre for Computational System Biology, Fudan University, Shanghai 200433, China,
4Molecular and Computational Biology Program, Department of Biological Sciences, University of Southern California, Los Angeles, CA 90089, USA,
  • Find this author on Google Scholar
  • Find this author on PubMed
  • Search for this author on this site
Hiroshi Mamitsuka
5Bioinformatics Center, Institute for Chemical Research, Kyoto University, Uji 611-0011, Japan,
6Department of Computer Science, Aalto University, Finland
  • Find this author on Google Scholar
  • Find this author on PubMed
  • Search for this author on this site
Shangfeng Zhu
1School of Computer Science and Shanghai Key Lab of Intelligent Information Processing and
2Centre for Computational System Biology, Fudan University, Shanghai 200433, China,
  • Find this author on Google Scholar
  • Find this author on PubMed
  • Search for this author on this site
  • For correspondence: zhusf@fudan.edu.cn
  • Abstract
  • Full Text
  • Info/History
  • Metrics
  • Preview PDF
Loading

Abstract

Motivation: Gene Ontology (GO) has been widely used to annotate functions of proteins and understand their biological roles. Currently only ¡1% of more than 70 million proteins in UniProtKB have experimental GO annotations, implying the strong necessity of automated function prediction (AFP) of proteins, where AFP is a hard multi-label classification problem due to one protein with a diverse number of GO terms. Most of these proteins have only sequences as input information, indicating the importance of sequence-based AFP (SAFP: sequences are the only input). Furthermore, homology-based SAFP tools are competitive in AFP competitions, while they do not necessarily work well for so-called difficult proteins, which have ¡60% sequence identity to proteins with annotations already. Thus, the vital and challenging problem now is to develop a method for SAFP, particularly for difficult proteins.

Methods: The key of this method is to extract not only homology information but also diverse, deep-rooted information/evidence from sequence inputs and integrate them into a predictor in an efficient and also effective manner. We propose GOLabeler, which integrates five component classifiers, trained from different features, including GO term frequency, sequence alignment, amino acid trigram, domains and motifs, and biophysical properties, etc., in the framework of learning to rank (LTR), a new paradigm of machine learning, especially powerful for multi-label classification.

Results: The empirical results obtained by examining GOLabeler extensively and thoroughly by using large-scale datasets revealed numerous favorable aspects of GOLabeler, including significant performance advantage over state-of-the-art AFP methods.

Contact: zhusf{at}fudan.edu.cn

Copyright 
The copyright holder for this preprint is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. All rights reserved. No reuse allowed without permission.
Back to top
PreviousNext
Posted June 03, 2017.
Download PDF
Email

Thank you for your interest in spreading the word about bioRxiv.

NOTE: Your email address is requested solely to identify you as the sender of this article.

Enter multiple addresses on separate lines or separate them with commas.
GOLabeler: Improving Sequence-based Large-scale Protein Function Prediction by Learning to Rank
(Your Name) has forwarded a page to you from bioRxiv
(Your Name) thought you would like to see this page from the bioRxiv website.
CAPTCHA
This question is for testing whether or not you are a human visitor and to prevent automated spam submissions.
Share
GOLabeler: Improving Sequence-based Large-scale Protein Function Prediction by Learning to Rank
Ronghui You, Zihan Zhang, Yi Xiong, Fengzhu Sun, Hiroshi Mamitsuka, Shangfeng Zhu
bioRxiv 145763; doi: https://doi.org/10.1101/145763
Digg logo Reddit logo Twitter logo CiteULike logo Facebook logo Google logo Mendeley logo
Citation Tools
GOLabeler: Improving Sequence-based Large-scale Protein Function Prediction by Learning to Rank
Ronghui You, Zihan Zhang, Yi Xiong, Fengzhu Sun, Hiroshi Mamitsuka, Shangfeng Zhu
bioRxiv 145763; doi: https://doi.org/10.1101/145763

Citation Manager Formats

  • BibTeX
  • Bookends
  • EasyBib
  • EndNote (tagged)
  • EndNote 8 (xml)
  • Medlars
  • Mendeley
  • Papers
  • RefWorks Tagged
  • Ref Manager
  • RIS
  • Zotero
  • Tweet Widget
  • Facebook Like
  • Google Plus One

Subject Area

  • Bioinformatics
Subject Areas
All Articles
  • Animal Behavior and Cognition (2410)
  • Biochemistry (4765)
  • Bioengineering (3310)
  • Bioinformatics (14607)
  • Biophysics (6600)
  • Cancer Biology (5144)
  • Cell Biology (7389)
  • Clinical Trials (138)
  • Developmental Biology (4330)
  • Ecology (6841)
  • Epidemiology (2057)
  • Evolutionary Biology (9860)
  • Genetics (7322)
  • Genomics (9483)
  • Immunology (4517)
  • Microbiology (12615)
  • Molecular Biology (4909)
  • Neuroscience (28171)
  • Paleontology (198)
  • Pathology (800)
  • Pharmacology and Toxicology (1375)
  • Physiology (2005)
  • Plant Biology (4461)
  • Scientific Communication and Education (973)
  • Synthetic Biology (1295)
  • Systems Biology (3898)
  • Zoology (719)