Skip to main content
bioRxiv
  • Home
  • About
  • Submit
  • ALERTS / RSS
Advanced Search
New Results

Towards a systematic characterization of protein complex function: a natural language processing and machine-learning framework

View ORCID ProfileVarun S. Sharma, View ORCID ProfileAndrea Fossati, Rodolfo Ciuffa, Marija Buljan, Evan G. Williams, Zhen Chen, Wenguang Shao, Patrick G.A. Pedrioli, Anthony W. Purcell, María Rodríguez Martínez, View ORCID ProfileJiangning Song, Matteo Manica, View ORCID ProfileRuedi Aebersold, View ORCID ProfileChen Li
doi: https://doi.org/10.1101/2021.02.24.432789
Varun S. Sharma
1Department of Biology, Institute of Molecular Systems Biology, ETH Zürich, Switzerland
2Institute for Neurodegenerative Diseases, University of California, San Francisco, CA 94143, USA
  • Find this author on Google Scholar
  • Find this author on PubMed
  • Search for this author on this site
  • ORCID record for Varun S. Sharma
Andrea Fossati
3Quantitative Biosciences Institute (QBI) and Department of Cellular and Molecular Pharmacology, University of California, San Francisco, CA 94158, USA
4J. David Gladstone Institutes, San Francisco, CA 94158, USA
  • Find this author on Google Scholar
  • Find this author on PubMed
  • Search for this author on this site
  • ORCID record for Andrea Fossati
Rodolfo Ciuffa
1Department of Biology, Institute of Molecular Systems Biology, ETH Zürich, Switzerland
  • Find this author on Google Scholar
  • Find this author on PubMed
  • Search for this author on this site
Marija Buljan
5Empa - Swiss Federal Laboratories for Materials Science and Technology, St. Gallen, Switzerland
  • Find this author on Google Scholar
  • Find this author on PubMed
  • Search for this author on this site
Evan G. Williams
6Luxembourg Centre for Systems Biomedicine, University of Luxembourg, Esch-sur-Alzette, Luxembourg
  • Find this author on Google Scholar
  • Find this author on PubMed
  • Search for this author on this site
Zhen Chen
7Collaborative Innovation Center of Henan Grain Crops, Henan Agricultural University, Zhengzhou 450046, China
  • Find this author on Google Scholar
  • Find this author on PubMed
  • Search for this author on this site
Wenguang Shao
1Department of Biology, Institute of Molecular Systems Biology, ETH Zürich, Switzerland
  • Find this author on Google Scholar
  • Find this author on PubMed
  • Search for this author on this site
Patrick G.A. Pedrioli
1Department of Biology, Institute of Molecular Systems Biology, ETH Zürich, Switzerland
  • Find this author on Google Scholar
  • Find this author on PubMed
  • Search for this author on this site
Anthony W. Purcell
8Department of Biochemistry and Molecular Biology and Monash Biomedicine Discovery Institute, Monash University, Clayton, Victoria, Australia
  • Find this author on Google Scholar
  • Find this author on PubMed
  • Search for this author on this site
María Rodríguez Martínez
9IBM Research Europe, Zürich, Switzerland
  • Find this author on Google Scholar
  • Find this author on PubMed
  • Search for this author on this site
Jiangning Song
8Department of Biochemistry and Molecular Biology and Monash Biomedicine Discovery Institute, Monash University, Clayton, Victoria, Australia
10Monash Centre for Data Science, Faculty of Information Technology, Monash University, Melbourne, VIC 3800, Australia
  • Find this author on Google Scholar
  • Find this author on PubMed
  • Search for this author on this site
  • ORCID record for Jiangning Song
  • For correspondence: Jiangning.Song@monash.edu TTE@zurich.ibm.com aebersold@imsb.biol.ethz.ch Chen.Li@monash.edu
Matteo Manica
9IBM Research Europe, Zürich, Switzerland
  • Find this author on Google Scholar
  • Find this author on PubMed
  • Search for this author on this site
  • For correspondence: Jiangning.Song@monash.edu TTE@zurich.ibm.com aebersold@imsb.biol.ethz.ch Chen.Li@monash.edu
Ruedi Aebersold
1Department of Biology, Institute of Molecular Systems Biology, ETH Zürich, Switzerland
11Faculty of Science, University of Zürich, Switzerland
  • Find this author on Google Scholar
  • Find this author on PubMed
  • Search for this author on this site
  • ORCID record for Ruedi Aebersold
  • For correspondence: Jiangning.Song@monash.edu TTE@zurich.ibm.com aebersold@imsb.biol.ethz.ch Chen.Li@monash.edu
Chen Li
1Department of Biology, Institute of Molecular Systems Biology, ETH Zürich, Switzerland
8Department of Biochemistry and Molecular Biology and Monash Biomedicine Discovery Institute, Monash University, Clayton, Victoria, Australia
  • Find this author on Google Scholar
  • Find this author on PubMed
  • Search for this author on this site
  • ORCID record for Chen Li
  • For correspondence: Jiangning.Song@monash.edu TTE@zurich.ibm.com aebersold@imsb.biol.ethz.ch Chen.Li@monash.edu
  • Abstract
  • Full Text
  • Info/History
  • Metrics
  • Preview PDF
Loading

Summary

It is a general assumption of molecular biology that the ensemble of expressed molecules, their activities and interactions determine biological processes, cellular states and phenotypes. Quantitative abundance of transcripts, proteins and metabolites are now routinely measured with considerable depth via an array of “OMICS” technologies, and recently a number of methods have also been introduced for the parallel analysis of the abundance, subunit composition and cell state specific changes of protein complexes. In comparison to the measurement of the molecular entities in a cell, the determination of their function remains experimentally challenging and labor-intensive. This holds particularly true for determining the function of protein complexes, which constitute the core functional assemblies of the cell. Therefore, the tremendous progress in multi-layer molecular profiling has been slow to translate into increased functional understanding of biological processes, cellular states and phenotypes. In this study we describe PCfun, a computational framework for the systematic annotation of protein complex function using Gene Ontology (GO) terms. This work is built upon the use of word embedding— natural language text embedded into continuous vector space that preserves semantic relationships— generated from the machine reading of 1 million open access PubMed Central articles. PCfun leverages the embedding for rapid annotation of protein complex function by integrating two approaches: (1) an unsupervised approach that obtains the nearest neighbor (NN) GO term word vectors for a protein complex query vector, and (2) a supervised approach using Random Forest (RF) models trained specifically for recovering the GO terms of protein complex queries described in the CORUM protein complex database. PCfun consolidates both approaches by performing the statistical test for the enrichment of the top NN GO terms within the child terms of the predicted GO terms by RF models. Thus, PCfun amalgamates information learned from the gold-standard protein-complex database, CORUM, with the unbiased predictions obtained directly from the word embedding, thereby enabling PCfun to identify the potential functions of putative protein complexes. The documentation and examples of the PCfun package are available at https://github.com/sharmavaruns/PCfun. We anticipate that PCfun will serve as a useful tool and novel paradigm for the large-scale characterization of protein complex function.

Competing Interest Statement

The authors have declared no competing interest.

Footnotes

  • ↵12 Lead contact

Copyright 
The copyright holder for this preprint is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under a CC-BY-ND 4.0 International license.
Back to top
PreviousNext
Posted February 25, 2021.
Download PDF
Email

Thank you for your interest in spreading the word about bioRxiv.

NOTE: Your email address is requested solely to identify you as the sender of this article.

Enter multiple addresses on separate lines or separate them with commas.
Towards a systematic characterization of protein complex function: a natural language processing and machine-learning framework
(Your Name) has forwarded a page to you from bioRxiv
(Your Name) thought you would like to see this page from the bioRxiv website.
CAPTCHA
This question is for testing whether or not you are a human visitor and to prevent automated spam submissions.
Share
Towards a systematic characterization of protein complex function: a natural language processing and machine-learning framework
Varun S. Sharma, Andrea Fossati, Rodolfo Ciuffa, Marija Buljan, Evan G. Williams, Zhen Chen, Wenguang Shao, Patrick G.A. Pedrioli, Anthony W. Purcell, María Rodríguez Martínez, Jiangning Song, Matteo Manica, Ruedi Aebersold, Chen Li
bioRxiv 2021.02.24.432789; doi: https://doi.org/10.1101/2021.02.24.432789
Reddit logo Twitter logo Facebook logo LinkedIn logo Mendeley logo
Citation Tools
Towards a systematic characterization of protein complex function: a natural language processing and machine-learning framework
Varun S. Sharma, Andrea Fossati, Rodolfo Ciuffa, Marija Buljan, Evan G. Williams, Zhen Chen, Wenguang Shao, Patrick G.A. Pedrioli, Anthony W. Purcell, María Rodríguez Martínez, Jiangning Song, Matteo Manica, Ruedi Aebersold, Chen Li
bioRxiv 2021.02.24.432789; doi: https://doi.org/10.1101/2021.02.24.432789

Citation Manager Formats

  • BibTeX
  • Bookends
  • EasyBib
  • EndNote (tagged)
  • EndNote 8 (xml)
  • Medlars
  • Mendeley
  • Papers
  • RefWorks Tagged
  • Ref Manager
  • RIS
  • Zotero
  • Tweet Widget
  • Facebook Like
  • Google Plus One

Subject Area

  • Bioinformatics
Subject Areas
All Articles
  • Animal Behavior and Cognition (4229)
  • Biochemistry (9108)
  • Bioengineering (6753)
  • Bioinformatics (23944)
  • Biophysics (12102)
  • Cancer Biology (9497)
  • Cell Biology (13742)
  • Clinical Trials (138)
  • Developmental Biology (7616)
  • Ecology (11662)
  • Epidemiology (2066)
  • Evolutionary Biology (15479)
  • Genetics (10620)
  • Genomics (14297)
  • Immunology (9467)
  • Microbiology (22795)
  • Molecular Biology (9078)
  • Neuroscience (48892)
  • Paleontology (355)
  • Pathology (1479)
  • Pharmacology and Toxicology (2565)
  • Physiology (3823)
  • Plant Biology (8309)
  • Scientific Communication and Education (1467)
  • Synthetic Biology (2290)
  • Systems Biology (6172)
  • Zoology (1297)