Protein secondary structure prediction using a small training set (compact model) combined with a Complex-valued neural network approach

Shamima Rashid; Saras Saraswathi; Andrzej Kloczkowski; Suresh Sundaram; Andrzej Kolinski

doi:10.1186/s12859-016-1209-0

Protein secondary structure prediction using a small training set (compact model) combined with a Complex-valued neural network approach

BMC Bioinformatics. 2016 Sep 13;17(1):362. doi: 10.1186/s12859-016-1209-0.

Authors

Shamima Rashid¹, Saras Saraswathi^{2

3}, Andrzej Kloczkowski^{2

4}, Suresh Sundaram⁵, Andrzej Kolinski⁶

Affiliations

¹ School of Computer Science and Engineering, Nanyang Technological University, 50 Nanyang Ave, Singapore, 639798, Singapore.
² Battelle Center for Mathematical Medicine, The Research Institute at Nationwide Children's Hospital, 700 Children's Drive, Columbus, USA.
³ Sidra Medical and Research Center, Al Dafna, Doha, Qatar.
⁴ Department of Paediatrics, College of Medicine, The Ohio State University, 370 W. 9th Avenue, Columbus, USA.
⁵ School of Computer Science and Engineering, Nanyang Technological University, 50 Nanyang Ave, Singapore, 639798, Singapore. ssundaram@ntu.edu.sg.
⁶ Laboratory of Theory of Biopolymers, Faculty of Chemistry, University of Warsaw, Pasteura 1, Warsaw, 02-093, Poland.

Abstract

Background: Protein secondary structure prediction (SSP) has been an area of intense research interest. Despite advances in recent methods conducted on large datasets, the estimated upper limit accuracy is yet to be reached. Since the predictions of SSP methods are applied as input to higher-level structure prediction pipelines, even small errors may have large perturbations in final models. Previous works relied on cross validation as an estimate of classifier accuracy. However, training on large numbers of protein chains compromises the classifier ability to generalize to new sequences. This prompts a novel approach to training and an investigation into the possible structural factors that lead to poor predictions. Here, a small group of 55 proteins termed the compact model is selected from the CB513 dataset using a heuristics-based approach. In a prior work, all sequences were represented as probability matrices of residues adopting each of Helix, Sheet and Coil states, based on energy calculations using the C-Alpha, C-Beta, Side-chain (CABS) algorithm. The functional relationship between the conformational energies computed with CABS force-field and residue states is approximated using a classifier termed the Fully Complex-valued Relaxation Network (FCRN). The FCRN is trained with the compact model proteins.

Results: The performance of the compact model is compared with traditional cross-validated accuracies and blind-tested on a dataset of G Switch proteins, obtaining accuracies of ∼81 %. The model demonstrates better results when compared to several techniques in the literature. A comparative case study of the worst performing chain identifies hydrogen bond contacts that lead to Coil ⇔ Sheet misclassifications. Overall, mispredicted Coil residues have a higher propensity to participate in backbone hydrogen bonding than correctly predicted Coils.

Conclusions: The implications of these findings are: (i) the choice of training proteins is important in preserving the generalization of a classifier to predict new sequences accurately and (ii) SSP techniques sensitive in distinguishing between backbone hydrogen bonding and side-chain or water-mediated hydrogen bonding might be needed in the reduction of Coil ⇔ Sheet misclassifications.

Keywords: Compact model; Complex-valued relaxation network; Efficient learning; Heuristics; Inhibitor peptides; Protein structure; Secondary structure prediction.

MeSH terms

Humans
Neural Networks, Computer*
Protein Structure, Secondary
Proteins / chemistry*

Substances

Proteins