Assigning Secondary Structure in Proteins using AI

Knowledge about protein structure assignment enriches the structural and functional understanding of proteins. Accurate and reliable structure assignment data is crucial for secondary structure prediction systems. Since the ’80s various methods based on hydrogen bond analysis and atomic coordinate geometry, followed by Machine Learning, have been employed in protein structure assignment. However, the assignment process becomes challenging when missing atoms are present in protein files. Our model develops a multi-class classifier program named DLFSA for assigning protein Secondary Structure Elements(SSE) using Convolutional Neural Networks(CNN). A fast and efficient GPU based parallel procedure extracts fragments from protein files. The model implemented in this work is trained with a subset of protein fragments and achieves 88.1% and 82.5% train and test accuracy, respectively. Our model uses only Cα coordinates for secondary structure assignments. The model is successfully tested on a few full-length proteins also. Results from the fragment-based studies demonstrate the feasibility of applying deep learning solutions for structure assignment problems.


Introduction
Pauling and Corey identify the existence of regular substructures namely, α − helices(H) and β − sheets(E), in protein molecules [1]. Irregular curves connecting these regular structures are called coils(C) [2][3] [4]. This threestate classification extends to a finer eight state classification that includes the states, viz. 3 10 helices (G), α − helices(H), π − helices (I), β − strands (E), β − bridges (B), turns (T), bends (S), and others (C) [5]. Among these eight states of secondary structure, some states occur rarely. Protein structure assignment is the process of associating secondary-structure information into experimentally determining coordinates of a protein. These secondary structure information has contributed to structural and computational chemistry, viz-protein structure modelling, protein design, structure comparisons, classifications, and visualizations. Protein structure prediction is a hard task, involving many degrees of freedom. Protein secondary structures are a simpler representation of its tertiary structure. Prior knowledge about the secondary structure reduces the complexity associated with the modelling and design tasks. Most protein structure modelling systems use this secondary structure information at their initial steps, as it cuts down the conformational search space substantially, and thereby accelerating the whole prediction process [6][7] [8]. These secondary structure prediction systems require structure assignment data that serves as ground truth for training the models.
Structure assignment was initially done manually by experts in the area, using visual inspection techniques, which often create discrepancies that lead to the process's automation. Later on, scientists implement various computational structure assignment tools. Most of these assignment programs operate on hydrogen bond and Cartesian coordinate analysis, and a few works by machine learning approaches. DSSP(Dictionary of Protein Secondary Structures) [9] and STRIDE(STructural IDEntification) are two gold standards in protein structure assignment. DSSP works on fine-grained protein structures and confirms a hydrogen bond when the electrostatic energy-E between each interacting pair is less than −0.5 kcal/mole.
Where q i , q j represents the charges on atoms separated by a distance r ij and, k is the Coulomb constant. By analyzing the hydrogen bonding patterns obtained, DSSP annotates the sequence with the secondary structure information. As it is practically impossible to obtain fine-grained information for all protein structures in repositories, a fast and accurate method that performs well with Cα positions is preferable. STRIDE [10] examines the phi-psi angle information along with hydrogen bond analysis data. PSSC [11], works on DSSP's output and provides a more advanced characterization of secondary structures. SECSTR [12] locates some of the π-helices not caught in DSSP and STRIDE assignment programs. DISICL [13], focuses on main chain dihedral angles for structure assignment. SEGNO [14] generates more reliable assignments for distorted locations in the structure. DEFINE [15], P-CURVE [16], PROSIGN [17], P-SEA [18], PALSSE [19], STICK [20], VoTAP [21] and SABA [22] uses atomic coordinate information for structure assignment wherein SABA introduces the concept of pseudo center, an imaginary point that lies between two consecutive C α atoms. VoTAP [21] applies Voroni tessellation to establish contacts between residues. SKSP [23] performs structural alignment of protein pairs prior to structure assignment, and the similarities with these aligned residues are used for structural assignments. SACF [24] introduces fragment-based approaches for conquering structure assignment problems. Here, assignments are done by aligning C α fragments against some template fragments. SST [25] implements a Bayesian method that attempts to maximize the joint probability of a hypotheses(regarding secondary structure assignment of given coordinates) and the data. In addition to those mentioned above there are several other related works [26], [27], [28], [29], [30] in this area.
PCASSO [31], a machine learning-based structure assignment program achieves significant improvements in accuracy and speed compared to other state-of-art methods. PCASSO applies random forest, a supervised learning approach for feature classification. Using C α and pseudo center coordinates, a feature vector consisting of 258 feature elements were calculated for each of the residues. These features are then processed using decision trees to assign structures to the residues. The best split for each node in the tree uses 16 random features out of 258 features. This method, based solely on C α coordinates, shows high speed and accuracy when compared to those that require intensive bond calculations and coordinate geometry analysis.
Random forest secondary structure assignment(RaFoSA) [32], another Artificial Intelligence (AI) technique, learns the secondary structure details using a random forest classifier and assigns a residue to one of the secondary structure classes. Features used by RaFoSA include residue type, C α -C α distances, angle between three C α atoms, torsional angle formed by four C α atoms and residue-residue contacts. This method finds its applicability in coarse-grained as well as all atom-based protein systems.
The above studies show that individual relationships among 3D coordinates of a protein structure clearly distinguish various secondary structure elements. Recent years witnessed a considerable deposition of data in protein repositories, and this abundance of data always enhance the performance of Deep Learning(DL) algorithms. DL technologies coupled with big data and GPU accelerated computing has made a broader impact in many areas, including protein modelling and design [33][34] [35][36] [37]. Utilizing the sheer volumes of data combined with DL techniques, it is now possible to extract the relationships existing in the input data to divert that knowledge to structure assignment problems. Still, the works in structure assignment using DL techniques is very rare or almost none. Here, a Convolutional Neural Network(CNN) based model automates structure assignment process. As protein fragments are easier to handle than full-length proteins, and to assess DL techniques' applicabil-ity to structure assignment process, the paper's model works with fragment structure assignments. The model takes protein fragments from homosapiens labelled by their three-state secondary structure information from protein structure file(PDB) files. Even though there are many direct and accurate methods for assigning secondary structure from tertiary protein structures, a DL based solution that gives faster results and better accuracy will benefit the computational biologists working on protein learning tasks.
The paper is organized as follows: Section 1 introduces and describes some of the state-of-art methods. Section 2 outlines the DLFSA design. Section 3 explains the dataset construction and other implementation aspects. Section 4 presents the results and discussions. Section 5 concludes the work.

Methods
Deep Learning-based Fragment Structure Assignment(DLFSA) is a deep learning model that predicts the secondary structures of protein fragments from atomic coordinate representations. The model outputs three types of secondary structures, viz. Helices(H), Sheets(E) and Coils/others(C). Since there is no standard fragment structure assignment dataset available for training these deep learning models, the design process's first step is constructing a fragment library. A subset of these fragments are prepossessed and fed into the proposed CNN model. CNN [38] are a class of deep neural networks that are suitable for learning the patterns from a sequence of C α coordinates and predicts whether the sequence forms a helix, sheet or coil. Fig. 1 shows a high-level design of the proposed work.

CNN for structure assignment
CNN are neural networks that extract relevant features from input data points by applying several filters. CNN [36][39] [40] [41] are quite powerful at capturing relationships among spatial data. Protein fragments represent a continuous sequence of coordinates in space. Each input file in the training set consists of 9 X 4 array on which the convolution window moves to extract the features. The CNN starts with a kernel window of size 2 X 2 followed by 16 filters and 64 filters. Max pooling technique performs dimensionality reduction followed by four fully connected layers. Dropout regularization method prevents overfitting. The classifier model ends in three nodes with a softmax activation function(equation 2) that predicts a structure's probability to a helix, sheet or coil. Labels were one hot encoded, with 0, 1, 2 representing sheet, helix and coil respectively. Fig. 2 shows the proposed CNN model's architecture with all its layers, filters, and other parameters.
The softmax function is given by the equation: A fragment is categorized as a helix in the fragment extraction process only if all residues took part in helix formation. The fragments are not just approximations, but exact categorization of its secondary structure. The output(label) indicates the secondary structure information of a fragment as a whole. Approximating a fragment structure class based on its majority residue nature is common in fragment library creation algorithms [43](e.g. if four residues out of six are involved in a helix formation, then it is categorized as a helix).
The proposed classification method considers all fragments other than the helices and sheets as coils. So, coil fragment libraries contain more fragments compared to others. These imbalances in the counts of helices, sheets and coils create problems while training the model. The model training chooses a subset of fragments with a similar count. Table 2 shows the fragment count statistics.

Data preprocessing
The data set consists of 1,19,281 fragment files containing residue names, atom names, atomic coordinates, occupancy factors, Etc. From these files, atomic coordinates of Cα atoms(7,15,671 Cα coordinates in total, Table 3), along with corresponding residue names are extracted and stored for further processing. The preprocessing step considers only fragments of length nine, six, five and three. Fragment library construction for fragment assembly methods selects mixed length fragments of size three, six, and nine as a complete set. The fragments under consideration are equalized to the length of available maximum, by padding zeros in the end. Next step converts the fragments into an array of dimension 9X4, where nine is the maximum allowable fragment length and, each line in a fragment file contains a residue name along with three coordinates of Cα atom. The integers from 0 to 19 represent the residue names in the dataset. The final step scales the input features to [-1,1] for better convergence of the model. Finally, the data set is randomly shuffled and split for training, testing and validation purposes. Fig. 1:Phase 3 shows the steps involved in data preprocessing.   Table 3 shows the hardware configuration of the system used for fragment library creation.

DLFSA web portal
The fragmented structure assignment model, DLFSA is made available to the public through the web portal -www.proteinallinfo.in. The web interface uses python Django [44] framework. DLFSA takes a fragment sequence as the input, and assign a secondary structure to it. The fragment sequence consists of the residue names, atomic coordinates of its Cα atoms and other parameters as in the PDB file format. The portal displays a sample file format. The maximum allowable length of a testing fragment is nine. The predicted structure represents the fragment's structure as a whole. The library generation source codes and the model codes are made open(https://github.com/jisnava/DLFSA/).

Results and Discussion
The results and accuracy of DLFSA, when compared to state-of-art methods, are given below.

Timing analysis
The fragment extraction algorithm written in CUDA C++ executes parallelly on the downloaded PDB files of homosapiens. The algorithm executed on Tesla K20 GPU card achieves significant speed enhancements over its serial counterpart. Fig. 4 shows the comparison statistics. The overall speedup achieved is given in Table 4, where T s is the serial execution time and T p is the parallel execution time.

Accuracy
The proposed model uses TensorFlow-Keras framework. Input features to the neural network take a batch size of 512. The activation function used for convolution is ReLU, and softmax for classification. The model is tuned to a learning rate of 0.001 and achieved a train and test accuracy of 88.1% and 82.5% respectively. Fig. 5 shows the accuracy graph and Fig. 6 plots the training and validation losses. The random selection of libraries' fragments may result in 5.6% difference in train and test accuracy. Since these fragments do not evenly distribute in all coordinate ranges, there is a chance of bias towards a range of coordinate values that results in miss-predictions of some coordinates. By applying techniques like clustering, fragments can be grouped based on some distance measures so that the cluster representatives cover a range of coordinate values. As there are more than 30 methods available in protein structure assignment techniques, it is practically impossible to compare the results with each of them. There are difficulties in accessing the URL mentioned in the papers, and some source codes we could manage to download are not executable, due to unresolved dependencies and lack of documentation. Hence results from DLFSA is compared with the tools which are considered as the gold standards, viz.
DSSP, STRIDE and also with the PDB data and some recent methods. The commonly used mapping M:(HGIEBTS-) → (hhhssccc) applies eight-state to three-state secondary structure reductions. The comparison chooses three-set of fragments for which DSSP assigns helix, Sheet and coil structures respectively. Fig. 7 tabulates the results of DLFSA program execution against DSSP, STRIDE and PDB. The details of fragments(including fragment length, protein identifier, chain name and starting position) used to compare with DSSP are provided in additional materials. Comparison with STRIDE and PDB also chooses a similar set of fragments. Even though the experiment uses limited length protein fragments, its applicability extends to small-length proteins. For full-length protein structures, the analysis uses a combination of varying length non-overlapping random fragments. The method successfully predicts the secondary structure information for small proteins. Fig. 8 shows the results of applying DLFSA on proteins from homosapiens, and Fig. 9 shows the secondary structure assignments done on a protein(ID: 2mnq), by various programs visualized through Chimera.
The results obtained shows that the proposed model gives better results on secondary structure assignments of protein fragments, using the features extracted from Cartesian coordinates. The model has been implemented for three state secondary structures, viz. helices, sheets and coils to enable smooth learning. The main difficulty in implementing the model for eight-state secondary structures is that all the eight states may not occur with equal probability in protein structures and hence, the model will not learn consistently from such unbalanced data. The same problem also occurs in secondary structure prediction tasks, when trained with the primary protein sequences and its eight-state secondary structure information. The fragment-based model implemented in the paper can be benefited to homology modelling systems also. These DL models for structure assignments(when extended to full-length, multi-domain proteins) finds application in quality assessment stages of protein structure prediction systems. Fig. 7: Secondary structure assignments by DLFSA, DSSP, STRIDE, PDB, P-SEA, STICK, SST and PCASSO for two proteins 5zgg(8a) and 2nmq(8b). The first row of the figure shows the primary sequence. The remaining rows represents the secondary structure assignments done by various methods. The | symbol separates the non-overlapping random fragments taken for comparison.

Conclusion
With the increase of 3D structures in protein repositories, it is now possible to automatically assign protein structures without human intervention. Today, state-of-the-art systems use machine learning with manually engineered feature extraction, and none of the assignment systems currently available is entirely based on Deep Learning techniques. We developed a CNN based model to automate protein structure assignment process. The model learns the spatial relationships among protein coordinates and utilizes it for secondary structure assignments. The model is successfully tested on protein fragments and a few full-length proteins. Assignment systems able to extract local and global features from the protein structures and use this to guide the structure assignment process itself, are now possible with Deep Learning. When provided with sufficient data, deep learning models outperform traditional approaches, in Natural Language Processing, Computer Vision, and Speech Recognition Systems, to name a few. Our model accuracy unarguably validates its applicability to secondary structure assignment problems and in challenging environments where only Cα atoms are available. The developed model shows comparable accuracy with the two gold standard methods in the area. This experiment highlights neural networks' ability to capture local structures from coordinate data. The current model extends to include 8-state secondary structures(Q8) for more precise predictions, provided with enough experimental data. Further studies are needed to develop this technique to large and multi-domain protein structures. With increasing computational powers and experimental data, more improvements are expected from computational biologists, on protein structure assignment tasks, for a faster and more accurate solution.

Funding
It is part of my(V. A. Jisna) PhD work at National Institute of Technology Calicut, India. The research is funded by Ministry of Human Resource Development, India.

Competing interests
The authors declare no competing interests.

Availability of data and material
DLFSA is made available to the public through the web portal -www.proteinallinfo.in. The datasets generated during and/or analysed during the current study are not publicly available due to its large size but are available from the corresponding author on reasonable request.

Code availability
The library generation source codes and the model codes are made open at https://github.com/jisnava/DLFSA/.

Authors Contributions
V. A. Jisna(VAJ) did the conceptualization. Prayagh Madhu did the model coding. VAJ developed the web portal and wrote the manuscript. P. B. Jayaraj supervised the project. All authors read and approved the final manuscript.

Acknowledgments
Authors would like to thank the Central Computing Centre, National Institute of Technology Calicut(NITC) for providing GPU servers for this work.