More challenges for machine-learning protein interactions

Bioinformatics. 2015 May 15;31(10):1521-5. doi: 10.1093/bioinformatics/btu857. Epub 2015 Jan 12.

Abstract

Motivation: Machine learning may be the most popular computational tool in molecular biology. Providing sustained performance estimates is challenging. The standard cross-validation protocols usually fail in biology. Park and Marcotte found that even refined protocols fail for protein-protein interactions (PPIs).

Results: Here, we sketch additional problems for the prediction of PPIs from sequence alone. First, it not only matters whether proteins A or B of a target interaction A-B are similar to proteins of training interactions (positives), but also whether A or B are similar to proteins of non-interactions (negatives). Second, training on multiple interaction partners per protein did not improve performance for new proteins (not used to train). In contrary, a strictly non-redundant training that ignored good data slightly improved the prediction of difficult cases. Third, which prediction method appears to be best crucially depends on the sequence similarity between the test and the training set, how many true interactions should be found and the expected ratio of negatives to positives. The correct assessment of performance is the most complicated task in the development of prediction methods. Our analyses suggest that PPIs square the challenge for this task.

Publication types

  • Research Support, Non-U.S. Gov't

MeSH terms

  • Artificial Intelligence*
  • Computational Biology / methods*
  • Humans
  • Protein Interaction Mapping / methods*
  • Proteins / metabolism*
  • Saccharomyces cerevisiae Proteins / metabolism
  • Sequence Analysis, Protein

Substances

  • Proteins
  • Saccharomyces cerevisiae Proteins