ProSPr: Democratized Implementation of Alphafold Protein Distance Prediction Network

Deep neural networks have recently enabled spectacular progress in predicting protein structures, as demonstrated by DeepMin’s winning entry with Alphalfold at the latest Critical Assessment, of Structure Prediction competition (CASP13). The best protein prediction pipeline leverages intermolecular distance predictions to assemble a final protein model, but this distance prediction network has not been published. Here, we make a trained implementation of this network available to the broader scientific community. We also benchmark its predictive power in the related task of contact prediction against the CASP13 contact prediction winner TripletRes. Access to ProSPr will enable other labs to build on best in class protein distance predictions and to engineer superior protein reconstruction methods.


Introduction
: ProSPr distance prediction: Input sequence is converted into 675 layer input vector on top left. ProSPr CNN predicts as auxiliaries secondary structure elements for each residue from 9 DSSP classes (SS) and Phi/Psi Torsion angles between 0 and 360 degrees (top center). Further, it predicts distance distributions between each residue pair, shown as distance histogram for reside 50 of CASP Target T1016-D1 and selected residues (top right). The maxima of the distance distribution form a distance prediction map (heatmap, bottom right); left bottom, the real distances as measured in structure le of T1016-D1.
The CNN, named ProSPr (Protein Structure Prediction), predicts the C β -C β distance distributions between all amino acid residues (C α for Glycine) in a given protein sequence. We trained three versions of ProSPr on sequences in the CATH S35 dataset [7] (Supplementary Note and Figure S1) with the same network architecture but dierent input vectors. ProSPr follows Alphafold exactly and uses as input features the sequence information, the results of multiple sequence alignments (MSA) computed with PSI-BLAST [8] and HHblits, [9] as well as a Potts model [10,11,12] calculated from the MSA. ProSPr2 omits the Potts model, and ProSPr3 only uses the sequence information as input.
The performance of these three models was tested on the CASP13 dataset for free and template-based models. The predicted distance distributions were converted into contact probabilities (distance between residues < 8 Å) and precision scores for three dierent classes of contacts were calculated according to the CASP assessment protocol. [13] ProSPr precision scores were directly compared to the performance of CASP13 winning CNN TripletRes [14] and are shown in Figure 2 (Supplementary Table S1). Without being explicitly trained for this purpose, ProSPr predicts contacts for 109 tested CASP13 domains with precision comparable to TripletRes over all classes, as shown in Table 1. Table 1 shows precision scores for ProSPr contacts with a maximum distance distribution < 8 Å, and for the full set of contacts independent of distribution maxima. For high condence predictions, with maximum <8 Å, ProSPr is on average 2% better than TripletRes on the L/5 scores. The L/2 and L scores are not directly comparable, because the absolute number of contacts ranked for ProSPr is substantially lower if the maximum < 8 Å criterion is applied than the total number of possible contacts ranked with TripletRes. For precision comparison the ranked probabilities of all contacts, independent of maximum, are therefore also reported. Under these conditions, we see that ProSPr is comparable to TripletRes, though on average slightly inferior. ProSPr2 results are comparable to ProSPr short and medium length contact predictions but are inferior to ProSPr long contact predictions. ProSPr3 is inferior to ProSPr in all categories. The performance of ProSPr2/3 was compared to TripletRes and is shown in Supplementary Figure S2. One issue with current precision reporting is that a smaller number of high condence predictions leads to an ination of L and L/2 scores, making model comparisons based on precision metric alone dicult to interpret. However, L/5 scores measure accurately the ability of a network to assign high condence contacts and ProSPr outperforms TripleRes by an average of 2 %, which is in agreement with reports given by the Alphafold authors.
(https://www.youtube.com/watch?v=uQ1uVbrIv-Q) Because ProSPr is trained to predict distances, the comparison against TripletRes only serves as a proof of concept. It would be a simple task to change the ProSPr network's nal layers and to train it explicitly for contact predictions, which was not the scope of this work. contacts where the maximum of the distance probability distribution falls between 0-8 Å, all other ProSPr rows sort contacts by total probability to be between 0-8 Å.
Next to the python-based source code a Docker [6] container of ProSPr is made available to enable rapid usage of the distance prediction protocol. The container includes input vectors for select CASP13 targets, three pre-trained ProSPr models, and the distance prediction function to reproduce the results reported here. In addition, the distribution includes all dependencies necessary to produce a distance prediction for arbitrary sequences. Furthermore, the training set based on the CATH database, including the MSA and Potts models, is made available (https://byu.box.com/v/ProteinStructurePrediction) to repeat the training outlined in the methods section (approximately 2 TB of data). The GitHub repository contains a training function that can be used to either improve a pretrained model, or to train a modied ProSPr model for further optimization or ablation testing (full training on CATH dataset takes~4 weeks on single T100 GPU). The original Alphafold protocol ensembled distance predictions over 4 separately trained models and subtracted a reference network during CASP13. A pretrained reference network is also provided that predicts distances only from sequence length and whether each residue is glycine (Supplementary Note). With time, we will make additional converged models of ProSPr and more comprehensive Docker containers available, to enable model ensembling.
The eld of protein structure prediction has to tackle the challenge of protein reconstruction from geometric distance restraint distributions. During CASP13 it became apparent that converting good distance predictions into chemically sound structures is still an unsolved problem. [4] ProSPr lowers the entrance barrier for academic labs and enables the community to quickly build on top of the internal coordinate predictions to develop improved protein reconstruction protocols. Further, we anticipate applications of ProSPr to investigate validity of evolutionary constraints as apparent from MSA, as ProSPr makes it possible to rapidly compare the eects of many single mutations on protein distances. These insights might also enable improved algorithms for in-silico drug discovery for mutated targets. In addition, we observed that ProSPr can interpolate distances between missing residues (Supplementary Figure S3), rendering it as a possible tool to support protein reconstruction from low resolution x-ray or cryo-EM data. [15] In conclusion, we have demonstrated that ProSPr, a CNN based on the scarce details available for Alphafold, predicts residue-residue contacts with accuracy comparable to CASP13 winner TripletRes. ProSPr has the potential to propel protein structure prediction forward by democratizing the deep neural network and to empower directed evolution and protein reconstruction eorts. ProSPr contacts to those of TripletRes. Contacts are colored in blue, green, yellow for short, mid, and long. Markers circle, x, star correspond to L, L/2, L/5.

Overview of ProSPr Architecture
Distance predictions within ProSPr can be made by calling distance prediction function, which consists of three steps as shown in Figure 3. Initially, a (L+32)x(L+32) prole is constructed for a sequence of length L using PSIBLAST, HHblits, a Potts model, and adding a frame of 32 bins as padding (Supplementary Note). Second, for a set 64x64 crops, dened by a stride parameter, of the prole an input vector with dimensions 675x64x64 is assembled. The input vector encodes the raw parameters, score, H parameters and Frobenius norm derived from the Potts model (total of 530 layers). Further, it contains two layers that hold the lists of residues for the crop, 42 layers for one-hot encoding of the sequence, 40 layers for a position specic substitution matrix (PSSM), 60 layers for the HHblits prole, and one layer for the sequence length. Third, the input layer is propagated through the CNN. After an initial batch norm, 1 dimensional convolution lters are applied to reshape the vector to a 128x64x64 matrix. This matrix is iterated 220 times through a residual network (RESNET) block that performs batch norming, applies the exponential linear unit (ELU) activation function, projects down to 64x64x64 dimensions, applies again batch norming and ELU, and then cycles through 4 dierent dilation lters. The dilation lters have sizes 1,2,4, and 8 and are applied with a padding of the same size to retain dimensionality. After a nal batch norm, the matrix is projected up to 128x64x64 and an identity addition is performed. After 220 iterations the nal matrix is subject to two 1 dimensional convolutions that reshape it into the nal distance and auxiliary predictions. The auxiliaries predict 8 classes of secondary structure as dened within the DSSP classications, and the phi and psi dihedrals for each residue; the angles are binned with 10 degrees resolution between 0 and 360. Due to possible gaps in the sequence, an additional classication bin is introduced for each auxiliary prediction that represents unassignable information. The auxiliary predictions were only used for training but could yield additional insights in ProSPr applications.

Training of ProSPr
ProSPr was trained on 64x64 crops extracted from the CATH S35 dataset [7] with 26393, 1000, and 500 domains randomly selected as training, validation, and test sets, respectively (Supplementary Note). Initial weights were assigned randomly with Pytorch, the loss was calculated using cross entropy and an Adam optimizer with learning rate of 0.001 was used to update the weights. Total loss was calculated as the weighted sum of ten times the distance loss, the losses of two secondary structure assignme`nts, and the losses of 4 torsion angles assignments. Training loss and validation loss converged after 500,000 iterations with training batch sizes of 8 (Supplementary Figure S1), which corresponds approximately to the number of total crops necessary to visit each subdomain in the training set once. The training of ProSPr2 and ProSPr3 used the same setup, only the input vectors contained dierent amount of information. For ProSPr2 all layers that contained Potts information were set to zero. For ProSPr3 the PSSM and HHBlits layers were also set to zero. For these networks, the training loss did not converge within 500,000 iterations (Supplementary Figure S1).

Convert distances into contacts
As a test, the distances for 109 CASP13 domains, which were not included in the training or validation sets, were predicted and converted into contacts. Instead of using all possible 64x64 crops, a stride of 25 was used between the crops to speed up evaluation of large domains. Average contact scores improved by 1% when a stride of 1 was used for the 44 shortest domains.
The 64x64x64 distance output encodes the probability of a residue i and j to have distances either not assignable (e.g. gap in sequence), in the range of 2.3 22 Å with .3 Å resolution between classes, or greater than 22 Å. If the maximum of the probability distribution fell between 2.3 8 Å (bins 1-19), we considered two residues in contact for the high condence predictions. In all cases, contacts were ranked according to the sum probability of distances between 2.3 and 8 Å and the top L, L/2, L/5 (L is length of sequence) contacts were selected to calculate accuracy scores. The contacts were classied based on the sequence separation of residues i and j into: short-range (6≤|i j|≤11), medium-range (12≤|i j|≤23) and long-range (|i j|≥24) contacts.

Evaluation of Contact Accuracy
According to CASP protocol, precision was calculated as follows: P recision = T rueP ositives T rueP ositives + F alseP ositives were not successfully downloaded, and brief manual inspection showed that structural les did not seem to exist for at least several of the domains specied in the s35 sequence list. Domains for which a structural le could not be obtained for any reason were excluded from the dataset.
Amino acid sequences for each domain were derived from the corresponding structure le. Although each domain was originally extracted from a FASTA le including the sequences, discrepancies existed between those explicit sequences and the string of residues contained in each structure le (e.g. some structures contained only portion(s) of the sequences specied separately). Additionally, monitoring of residue numbers present in the structure les enabled us to denote gaps in the sequences with a unique character, whereas no such notation existed in the original FASTA sequences.
For reasons further contextualized under Crops and Input Padding, we trimmed the beginning(s) and/or end(s) of protein sequences if gaps larger than or equal to 32 residues separated small terminal segments of fewer than 17 residues from the remainder of the protein. This eliminated instances in which possibly hundreds of residues whose identities and/or positions were undetermined in the protein structure (gaps) were included in the structure-derived sequence because several adjacent terminal residues are recorded (see CATH domain 1hu3A00 as example). In such cases, the characters corresponding to these end residues and the adjacent gap section(s) were removed from sequence, and the process was repeated for the remaining sequence until no additional changes were made. At the conclusion of these sequence modications, corresponding structure les were trimmed to match the shorter sequences.
PSSMs were constructed using PSIBLAST (ftp://ftp.ncbi.nlm.nih.gov/blast/executables/blast+/LATEST/) and the nonredundant (nr) database, while alignments were generated using HHBlits from HHSuite (https://github.com/soedinglab/hhsuite) in conjunction with the uniclust30_2018_08 database (http://wwwuser.gwdg.de/∼compbiol/uniclust/2018_08/.UNICLUST ). Both ran with E-values of 0.001 and completed 3 iterations. A limit of 100000 was imposed on the number of sequences HHBlits could write out to the alignment le. HHBlits alignment results were then processed further and used in computing the Potts models (original obtained from https://github.com/magnusekeberg/plmDCA, however modications were made to the code and those modications are in our github repository for this project under src/potts.patch). Further information about these programs, as well as details concerning how the values of interest were extracted from these three sets of output les and used in the input vectors can be found in the code documentation.
Training labels for the distance prediction task as well as the auxiliary secondary structure and torsion angle predictions were created by making structural calculations (in the case of the auxiliary predictions, aided by the DSSP algorithm, code available at https://github.com/cmbi/xssp) and binning the observations, thus enabling ProSPr to treat each as a classication task. Pairwise distances between all available beta carbon atoms (except for alpha carbons in the case of glycine) were classied into 64 classes, where 62 represented equivalently-sized bins over the 2-22Å range (~0.32Å width each), one represented all distances greater than 22Å, and another signied a gap or missing part of the protein structure.
Secondary structure classications as made by DSSP were retained as separate classes, with the addition of one extra bin to again represent missing data (eg. no residue(s) in that part of the structure) for a total of 9 classes. Phi and psi torsion angle values were classied into 36 10º bins ranging from -180º to 180º and one for gaps, resulting in a total of 37 possible classes.
After all inputs and labels were generated for available domains, an intersection was performed to extract the list of domains for which each of these independent tasks had executed successfully; this list was subsequently divided into the training, validation, and test sets as described in Methods.  missing residues from the CATH validation set we observed that ProSPr predicts distances for missing residues (yellow crosses in label). This might be a useful feature to enhance structure reconstruction eorts and is subject to further analysis.

Crops and Padding
Using the dataloader function included and documented in the project source code, training is performed on 64x64 amino acid crops of training domains, specied using i and j (coordinate) options and the protein domain id. The i and j coordinates correspond to the amino acid positions on which the crop is centered, and can range from 0 to one less than the sequence length of that domain. This results in padding of up to 32 being added in either dimension so that the crop maintains its 64x64 size. Training on crops allows the network to use more training data than if the entire domains were used, and the consistent size helps in distributed training.
However, in testing and application of such a network, distance predictions for the entire protein domain are often of much greater relevance than those made for a single 64x64 residue crop. Therefore, outside of training (including for validation set testing, the analysis done with the CASP13 targets, etc.) full-domain predictions are made by processing multiple adjacent crops of the same domain and averaging those values where the crops overlap. The stride parameter species how far apart the i,j, coordinates for each crop are (eg. stride of 1 means that every possible i,j combination for the length of the protein will be processed), however the evaluation time increases as the square of the protein length. On the contacts data set we observed only small improvements of 1% by reducing the stride from 25 to 1.