Skip to main content
bioRxiv
  • Home
  • About
  • Submit
  • ALERTS / RSS
Advanced Search
New Results

Assembling a corpus of phosphoproteomic annotations using ProtMapper to normalize site information from databases and text mining

View ORCID ProfileJohn A. Bachman, View ORCID ProfilePeter K. Sorger, View ORCID ProfileBenjamin M. Gyori
doi: https://doi.org/10.1101/822668
John A. Bachman
1Laboratory of Systems Pharmacology, Harvard Medical School, 200 Longwood Avenue, Boston, MA 02115
  • Find this author on Google Scholar
  • Find this author on PubMed
  • Search for this author on this site
  • ORCID record for John A. Bachman
Peter K. Sorger
1Laboratory of Systems Pharmacology, Harvard Medical School, 200 Longwood Avenue, Boston, MA 02115
  • Find this author on Google Scholar
  • Find this author on PubMed
  • Search for this author on this site
  • ORCID record for Peter K. Sorger
  • For correspondence: benjamin_gyori@hms.harvard.edu lsp-papers@hms.harvard.edu peter_sorger@hms.harvard.edu
Benjamin M. Gyori
1Laboratory of Systems Pharmacology, Harvard Medical School, 200 Longwood Avenue, Boston, MA 02115
  • Find this author on Google Scholar
  • Find this author on PubMed
  • Search for this author on this site
  • ORCID record for Benjamin M. Gyori
  • For correspondence: benjamin_gyori@hms.harvard.edu lsp-papers@hms.harvard.edu peter_sorger@hms.harvard.edu
  • Abstract
  • Full Text
  • Info/History
  • Metrics
  • Supplementary material
  • Data/Code
  • Preview PDF
Loading

Abstract

Protein phosphorylation regulates numerous cellular processes and is highly studied in biology.However, the analysis of phosphoproteomic datasets remains challenging due to limited information on upstream regulators of phosphosites, which is fragmented across multiple curated databases and unstructured literature. When aggregating information on phosphosites from six databases and three text mining systems, we found that a substantial proportion of phosphosites were mentioned at residue positions not matching the reference sequence. These errors were often attributable to the use of residue numbers from non-canonical protein isoforms, mouse or rat proteins, or post-translationally processed proteins. Non-canonical site numbering is also prevalent in mass spectrometry datasets from large-scale efforts such as the Clinical Proteomic Tumor Analysis Consortium (CPTAC). To address these issues, we developed ProtMapper, an open-source Python tool that automatically normalizes site positions to human protein reference sequences. We used ProtMapper coupled with the INDRA knowledge assembly system to create a corpus of 37,028 regulatory annotations for 16,332 sites – to our knowledge, the most comprehensive corpus of literature-derived information about phosphosite regulation currently available. This work highlights how automated phosphosite normalization coupled to text mining and knowledge assembly allows researchers to leverage phosphosite information that exists within the scientific literature.

Competing Interest Statement

PKS is a co-founder and member of the BOD of Glencoe Software, a member of the BOD for Applied Biomath, and a member of the SAB for RareCyte, NanoString and Montai Health; he holds equity in Glencoe, Applied Biomath and RareCyte. PKS is a consultant for Merck and the Sorger lab has received research funding from Novartis and Merck in the past five years. PKS declares that none of these activities have influenced the content of this manuscript. JAB is currently an employee of Google, LLC. BMG declares no outside interests.

Footnotes

  • Phosphosite statistics updated to reflect current database and text mining content as of July 2022. Inclusion of INDRA estimates of annotation reliability. Revisions to text. Change in author order.

  • https://github.com/indralab/protmapper

  • https://github.com/indralab/protmapper_paper

Copyright 
The copyright holder for this preprint is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under a CC-BY 4.0 International license.
Back to top
PreviousNext
Posted October 29, 2022.
Download PDF

Supplementary Material

Data/Code
Email

Thank you for your interest in spreading the word about bioRxiv.

NOTE: Your email address is requested solely to identify you as the sender of this article.

Enter multiple addresses on separate lines or separate them with commas.
Assembling a corpus of phosphoproteomic annotations using ProtMapper to normalize site information from databases and text mining
(Your Name) has forwarded a page to you from bioRxiv
(Your Name) thought you would like to see this page from the bioRxiv website.
CAPTCHA
This question is for testing whether or not you are a human visitor and to prevent automated spam submissions.
Share
Assembling a corpus of phosphoproteomic annotations using ProtMapper to normalize site information from databases and text mining
John A. Bachman, Peter K. Sorger, Benjamin M. Gyori
bioRxiv 822668; doi: https://doi.org/10.1101/822668
Digg logo Reddit logo Twitter logo Facebook logo Google logo LinkedIn logo Mendeley logo
Citation Tools
Assembling a corpus of phosphoproteomic annotations using ProtMapper to normalize site information from databases and text mining
John A. Bachman, Peter K. Sorger, Benjamin M. Gyori
bioRxiv 822668; doi: https://doi.org/10.1101/822668

Citation Manager Formats

  • BibTeX
  • Bookends
  • EasyBib
  • EndNote (tagged)
  • EndNote 8 (xml)
  • Medlars
  • Mendeley
  • Papers
  • RefWorks Tagged
  • Ref Manager
  • RIS
  • Zotero
  • Tweet Widget
  • Facebook Like
  • Google Plus One

Subject Area

  • Bioinformatics
Subject Areas
All Articles
  • Animal Behavior and Cognition (4091)
  • Biochemistry (8776)
  • Bioengineering (6488)
  • Bioinformatics (23359)
  • Biophysics (11758)
  • Cancer Biology (9156)
  • Cell Biology (13257)
  • Clinical Trials (138)
  • Developmental Biology (7418)
  • Ecology (11376)
  • Epidemiology (2066)
  • Evolutionary Biology (15096)
  • Genetics (10404)
  • Genomics (14014)
  • Immunology (9130)
  • Microbiology (22072)
  • Molecular Biology (8783)
  • Neuroscience (47402)
  • Paleontology (350)
  • Pathology (1421)
  • Pharmacology and Toxicology (2482)
  • Physiology (3706)
  • Plant Biology (8055)
  • Scientific Communication and Education (1433)
  • Synthetic Biology (2211)
  • Systems Biology (6017)
  • Zoology (1251)