Abstract
Based on the development of new algorithms and growth of sequence databases, it has recently become possible to build robust and informative higher-order statistical sequence models based on large sets of aligned protein sequences. By disentangling direct and indirect effects, such models have proven useful to assess phenotypic landscapes, determine protein-protein interaction sites, and in de novo structure prediction. In the context of structure prediction, the sequence models are used to find pairs of residues that co-vary during evolution, and hence are likely to be in spatial proximity in the functional native protein. The accuracy of these algorithms, however, drop dramatically when the number of sequences in the alignment is small, and thus the highest ranking pairs may include a substantial number of false positive predictions. We have developed a method that we termed CE-YAPP (CoEvolution-YAPP), that is based on YAPP (Yet Another Peak Processor), which has been shown to solve a similar problem in NMR spectroscopy. By simultaneously performing structure prediction and contact assignment, CE-YAPP uses structural self-consistency as a filter to remove false positive contacts. At the same time CE-YAPP solves another problem, namely how many contacts to choose from the ordered list of covarying amino acid pairs. Our results show that CE-YAPP consistently and substantially improves contact prediction from multiple sequence alignments, in particular for proteins that are difficult targets. We further show that CE-YAPP can be integrated with many different contact prediction methods, and thus will benefit also from improvements in algorithms for sequence analyses. Finally, we show that the structures determined from CE-YAPP are also in better agreement with those determined using traditional methods in structural biology.
Author summary Homologous proteins generally have similar functions and three-dimensional structures. This in turn means that it is possible to extract structural information from a detailed analysis of a multiple sequence alignment of a protein sequence. In particular, it has been shown that global statistical analyses of such sequence alignments allows one to find pairs of residues that have covaried during evolution, and that such pairs are likely to be in close contact in the folded protein structure. Although these insights have led to important developments in our ability to predict protein structures, these methods generally result in many false positive contacts predicted when the number of homologous sequences is not large. To deal with this issue, we have developed CE-YAPP, a method that can take a noisy set of predicted contacts as input and robustly detect many incorrectly predicted contacts within these. More specifically, our method performs simultaneous structure prediction and contact assignment so as to use structural self-consistency as a filter for erroneous predictions. In this way, CE-YAPP improves contact and structure predictions, and thus advances our ability to extract structural information from analyses of the evolutionary record of a protein.