Skip to main content

Thank you for visiting nature.com. You are using a browser version with limited support for CSS. To obtain the best experience, we recommend you use a more up to date browser (or turn off compatibility mode in Internet Explorer). In the meantime, to ensure continued support, we are displaying the site without styles and JavaScript.

  • Letter
  • Published:

Intensity-based protein identification by machine learning from a library of tandem mass spectra

Abstract

Tandem mass spectrometry (MS/MS) has emerged as a cornerstone of proteomics owing in part to robust spectral interpretation algorithms1,2,3,4,5,6. Widely used algorithms do not fully exploit the intensity patterns present in mass spectra. Here, we demonstrate that intensity pattern modeling improves peptide and protein identification from MS/MS spectra. We modeled fragment ion intensities using a machine-learning approach that estimates the likelihood of observed intensities given peptide and fragment attributes. From 1,000,000 spectra, we chose 27,000 with high-quality, nonredundant matches as training data. Using the same 27,000 spectra, intensity was similarly modeled with mismatched peptides. We used these two probabilistic models to compute the relative likelihood of an observed spectrum given that a candidate peptide is matched or mismatched. We used a 'decoy' proteome approach to estimate incorrect match frequency7, and demonstrated that an intensity-based method reduces peptide identification error by 50–96% without any loss in sensitivity.

This is a preview of subscription content, access via your institution

Access options

Rent or buy this article

Prices vary by article type

from$1.95

to$39.95

Prices may be subject to local taxes which are calculated during checkout

Figure 1: Probabilistic match decision tree automatically learned from fragment intensity data.
Figure 2: Example of lodi spectra.
Figure 3: Target/decoy discrimination analysis of PSM scoring methods using a S. cerevisiae proteome-wide set of MS/MS spectra searched with the SEQUEST algorithm.

Similar content being viewed by others

References

  1. Pandey, A. & Mann, M. Proteomics to study genes and genomes. Nature 405, 837–846 (2000).

    Article  CAS  Google Scholar 

  2. Mann, M., Hendrickson, R.C. & Pandey, A. Analysis of proteins and proteomes by mass spectrometry. Annu. Rev. Biochem. 70, 437–473 (2001).

    Article  CAS  Google Scholar 

  3. Aebersold, R. & Goodlett, D.R. Mass spectrometry in proteomics. Chem. Rev. 101, 269–295 (2001).

    Article  CAS  Google Scholar 

  4. Aebersold, R. & Mann, M. Mass spectrometry-based proteomics. Nature 422, 198–207 (2003).

    Article  CAS  Google Scholar 

  5. Tyers, M. & Mann, M. From genomics to proteomics. Nature 422, 193–197 (2003).

    Article  CAS  Google Scholar 

  6. Gay, S., Binz, P.A., Hochstrasser, D.F. & Appel, R.D. Peptide mass fingerprinting peak intensity prediction: extracting knowledge from spectra. Proteomics 2, 1374–1391 (2002).

    Article  CAS  Google Scholar 

  7. Peng, J., Elias, J.E., Thoreen, C.C., Licklider, L.J. & Gygi, S.P. Evaluation of multidimensional chromatography coupled with tandem mass spectrometry (LC/LC-MS/MS) for large-scale protein analysis: the yeast proteome. J. Proteome Res. 2, 43–50 (2003).

    Article  CAS  Google Scholar 

  8. Eng, J., McCormack, A. & Yates, J.R. 3rd. An approach to correlate tandem mass spectral data of peptides with amino acid sequences in a protein database. J. Am. Soc. Mass Spectrom. 5, 976–989 (1994).

    Article  CAS  Google Scholar 

  9. Jensen, F.V. Bayesian Networks and Decision Graphs (Springer, New York, 2001).

    Book  Google Scholar 

  10. King, O.D., Foulger, R.E., Dwight, S.S., White, J.V. & Roth, F.P. Predicting gene function from patterns of annotation. Genome Res. 13, 896–904 (2003).

    Article  CAS  Google Scholar 

  11. Shannon, C.E. A mathematical theory of communication. Bell Syst. Tech J. 27, 379–423,623–656 (1948).

    Article  Google Scholar 

  12. Papayannopoulos, I.A. The interpretation of collision-induced dissociation tandem mass spectra of peptides. Mass Spectrom. Rev. 14, 4973 (1995).

    Article  Google Scholar 

  13. Breci, L.A., Tabb, D.L., Yates, J.R. 3rd & Wysocki, V.H. Cleavage N-terminal to proline: analysis of a database of peptide tandem mass spectra. Anal. Chem. 75, 1963–1971 (2003).

    Article  CAS  Google Scholar 

  14. Tabb, D.L. et al. Statistical characterization of ion trap tandem mass spectra from doubly charged tryptic peptides. Anal. Chem. 75, 1155–1163 (2003).

    Article  CAS  Google Scholar 

  15. Keller, A., Nesvizhskii, A.I., Kolker, E. & Aebersold, R. Empirical statistical model to estimate the accuracy of peptide identifications made by MS/MS and database search. Anal. Chem. 74, 5383–5392 (2002).

    Article  CAS  Google Scholar 

  16. Florens, L. et al. A proteomic view of the Plasmodium falciparum life cycle. Nature 419, 520–526 (2002).

    Article  CAS  Google Scholar 

  17. Perkins, D., Pappin, D., Creasy, D. & Cottrell, J. Probability-based protein identification by searching sequence databases using mass spectrometry data. Electrophoresis 20, 3551–3567 (1999).

    Article  CAS  Google Scholar 

  18. Peng, J. & Gygi, S.P. Proteomics: the move to mixtures. J. Mass Spectrom. 36, 1083–1091 (2001).

    Article  CAS  Google Scholar 

  19. Harrison, A.G. The gas-phase basicities and proton affinities of amino acids and peptides. Mass Spectrom. Rev. 16, 201–217 (1997).

    Article  CAS  Google Scholar 

  20. Deber, C.M. et al. TM Finder: a prediction program for transmembrane protein segments using a combination of hydrophobicity and nonpolar phase helicity scales. Protein Sci. 10, 212–219 (2001).

    Article  CAS  Google Scholar 

  21. Washburn, M., Wolters, D. & Yates, J.R. 3rd. Large-scale analysis of the yeast proteome by multidimensional protein identification technology. Nat. Biotechnol. 19, 242–247 (2001).

    Article  CAS  Google Scholar 

Download references

Acknowledgements

We thank the past and present members of the Gygi Lab and Taplin Biological Mass Spectrometry Facility for generating the spectra used in this study. We gratefully acknowledge John Cottrell of Matrix Science, for examining our test data set with the Mascot algorithm. Particular thanks to S. Gerber, J. Jebanathirajah and H. Steen for insightful comments during manuscript preparation. This work was supported in part by National Institutes of Health (NIH) HG00041 (S.P.G.), NIH National Research Service Award 5T32CA86878 from the National Cancer Institute (J.E.E.), by an institutional grant from the Howard Hughes Medical Institute (F.P.R., F.D.G.), and by National Research Service Award fellowship from NIH/National Human Genome Research Institute (O.D.K.).

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Frederick P Roth.

Ethics declarations

Competing interests

The authors declare no competing financial interests.

Supplementary information

Rights and permissions

Reprints and permissions

About this article

Cite this article

Elias, J., Gibbons, F., King, O. et al. Intensity-based protein identification by machine learning from a library of tandem mass spectra. Nat Biotechnol 22, 214–219 (2004). https://doi.org/10.1038/nbt930

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1038/nbt930

This article is cited by

Search

Quick links

Nature Briefing

Sign up for the Nature Briefing newsletter — what matters in science, free to your inbox daily.

Get the most important science stories of the day, free in your inbox. Sign up for Nature Briefing