## Abstract

Deep neural networks have recently enabled spectacular progress in predicting protein structures, as demonstrated by DeepMind’s winning entry with Alphafold at the latest Critical Assessment of Structure Prediction competition (CASP13). The best protein prediction pipeline leverages intermolecular distance predictions to assemble a final protein model, but this distance prediction network has not been published. Here, we make a trained implementation of this network available to the broader scientific community. We also benchmark its predictive power in the related task of contact prediction against the CASP13 contact prediction winner TripletRes. Access to ProSPr will enable other labs to build on best in class protein distance predictions and to engineer superior protein reconstruction methods.

## Introduction

Recently, a variety of powerful protein structure prediction methods, based on machine learning algorithms, have been reported.^{1} Although direct prediction of structure from sequence has been attempted,^{2} reproducible success is currently based on two-stage protocols.^{3} The first stage is the training of a deep convolutional neural network (CNN) that predicts some macromolecular structure restraints like residue to residue distances, residue contacts, dihedral angles or secondary structure assignments.^{4} In a second stage, these restraints are used to construct a folded three-dimensional structure of the target protein. In the recent Critical Assessment of Structure Prediction (CASP13) a two stage folding protocol developed by DeepMind outperformed all established academic groups and predicted 25 of 43 protein structures with highest quality.^{5} Unfortunately, DeepMind has not expressed a plan to publish the source code of their Alphafold protocol.

## Results & Discussion

Here, we report the re-implementation of the first part of the Alphafold pipeline, an intramolecular distance prediction CNN, made freely available as source code (https://github.com/dellacortelab/prospr) and a Docker^{6} container (see Methods). The CNN is in agreement with architectural details revealed by DeepMind at the December 2018 CASP13 conference (http://predictioncenter.org/casp13/doc/presentations/) and recently presented at a symposium at Washington University (https://www.youtube.com/watch?v=uQ1uVbrIv-Q); however, certain design decisions and hyperparameters were not shared in sufficient detail and required re-engineering. A graphical abstract of the CNN is given in Figure 1.

The CNN, named ProSPr (**Pro**tein **S**tructure **Pr**ediction), predicts the C_{β}-C_{β} distance distributions between all amino acid residues (C_{α} for Glycine) in a given protein sequence. We trained three versions of ProSPr on sequences in the CATH S35 dataset^{7} (Supplementary Note and Figure S1) with the same network architecture but different input vectors. ProSPr follows Alphafold exactly and uses as input features the sequence information, the results of multiple sequence alignments (MSA) computed with PSI-BLAST^{8} and HHblits,^{9} as well as a Potts model^{10,11,12} calculated from the MSA. ProSPr2 omits the Potts model, and ProSPr3 only uses the sequence information as input.

The performance of these three models was tested on the CASP13 dataset for free and template-based models. The predicted distance distributions were converted into contact probabilities (distance between residues < 8 Å) and precision scores for three different classes of contacts were calculated according to the CASP assessment protocol.^{13} ProSPr precision scores were directly compared to the performance of CASP13 winning CNN TripletRes^{14} and are shown in Figure 2 (Supplementary Table S1). Without being explicitly trained for this purpose, ProSPr predicts contacts for 109 tested CASP13 domains with precision comparable to TripletRes over all classes, as shown in Table 1. Table 1 shows precision scores for ProSPr contacts with a maximum distance distribution < 8 Å, and for the full set of contacts independent of distribution maxima. For high confidence predictions, with maximum <8 Å, ProSPr is on average 2% better than TripletRes on the L/5 scores. The L/2 and L scores are not directly comparable, because the absolute number of contacts ranked for ProSPr is substantially lower if the maximum < 8 Å criterion is applied than the total number of possible contacts ranked with TripletRes. For precision comparison the ranked probabilities of all contacts, independent of maximum, are therefore also reported. Under these conditions, we see that ProSPr is comparable to TripletRes, though on average slightly inferior. ProSPr2 results are comparable to ProSPr short and medium length contact predictions but are inferior to ProSPr long contact predictions. ProSPr3 is inferior to ProSPr in all categories. The performance of ProSPr2/3 was compared to TripletRes and is shown in Supplementary Figure S2. One issue with current precision reporting is that a smaller number of high confidence predictions leads to an inflation of L and L/2 scores, making model comparisons based on precision metric alone difficult to interpret. However, L/5 scores measure accurately the ability of a network to assign high confidence contacts and ProSPr outperforms TripleRes by an average of 2 %, which is in agreement with reports given by the Alphafold authors. (https://www.youtube.com/watch?v=uQ1uVbrIv-Q) Because ProSPr is trained to predict distances, the comparison against TripletRes only serves as a proof of concept. It would be a simple task to change the ProSPr network’s final layers and to train it explicitly for contact predictions, which was not the scope of this work.

Next to the python-based source code a Docker^{6} container of ProSPr is made available to enable rapid usage of the distance prediction protocol. The container includes input vectors for select CASP13 targets, three pretrained ProSPr models, and the distance prediction function to reproduce the results reported here. In addition, the distribution includes all dependencies necessary to produce a distance prediction for arbitrary sequences. Furthermore, the training set based on the CATH database, including the MSA and Potts models, is made available (https://byu.box.com/v/ProteinStructurePrediction) to repeat the training outlined in the methods section (approximately 2 TB of data). The GitHub repository contains a training function that can be used to either improve a pretrained model, or to train a modified ProSPr model for further optimization or ablation testing (full training on CATH dataset takes ~4 weeks on single T100 GPU). The original Alphafold protocol ensembled distance predictions over 4 separately trained models and subtracted a reference network during CASP13. A pretrained reference network is also provided that predicts distances only from sequence length and whether each residue is glycine (Supplementary Note). With time, we will make additional converged models of ProSPr and more comprehensive Docker containers available, to enable model ensembling.

The field of protein structure prediction has to tackle the challenge of protein reconstruction from geometric distance restraint distributions. During CASP13 it became apparent that converting good distance predictions into chemically sound structures is still an unsolved problem.^{4} ProSPr lowers the entrance barrier for academic labs and enables the community to quickly build on top of the internal coordinate predictions to develop improved protein reconstruction protocols. Further, we anticipate applications of ProSPr to investigate validity of evolutionary constraints as apparent from MSA, as ProSPr makes it possible to rapidly compare the effects of many single mutations on protein distances. These insights might also enable improved algorithms for in-silico drug discovery for mutated targets. In addition, we observed that ProSPr can interpolate distances between missing residues (Supplementary Figure S3), rendering it as a possible tool to support protein reconstruction from low resolution x-ray or cryo-EM data.^{15}

In conclusion, we have demonstrated that ProSPr, a CNN based on the scarce details available for Alphafold, predicts residue-residue contacts with accuracy comparable to CASP13 winner TripletRes. ProSPr has the potential to propel protein structure prediction forward by democratizing the deep neural network and to empower directed evolution and protein reconstruction efforts.

## Methods

### Overview of ProSPr Architecture

Distance predictions within ProSPr can be made by calling distance prediction function, which consists of three steps as shown in Figure M1. Initially, a (L+32)x(L+32) profile is constructed for a sequence of length L using PSIBLAST, HHblits, a Potts model, and adding a frame of 32 bins as padding (Supplementary Note). Second, for a set 64×64 crops, defined by a stride parameter, of the profile an input vector with dimensions 675×64×64 is assembled. The input vector encodes the raw parameters, score, H parameters and Frobenius norm derived from the Potts model (total of 530 layers). Further, it contains two layers that hold the lists of residues for the crop, 42 layers for one-hot encoding of the sequence, 40 layers for a position specific substitution matrix (PSSM), 60 layers for the HHblits profile, and one layer for the sequence length. Third, the input layer is propagated through the CNN. After an initial batch norm, 1 dimensional convolution filters are applied to reshape the vector to a 128×64×64 matrix. This matrix is iterated 220 times through a residual network (RESNET) block that performs batch norming, applies the exponential linear unit (ELU) activation function, projects down to 64×64×64 dimensions, applies again batch norming and ELU, and then cycles through 4 different dilation filters. The dilation filters have sizes 1,2,4, and 8 and are applied with a padding of the same size to retain dimensionality. After a final batch norm, the matrix is projected up to 128×64×64 and an identity addition is performed. After 220 iterations the final matrix is subject to two 1 dimensional convolutions that reshape it into the final distance and auxiliary predictions. The auxiliaries predict 8 classes of secondary structure as defined within the DSSP classifications, and the phi and psi dihedrals for each residue; the angles are binned with 10 degrees resolution between 0 and 360. Due to possible gaps in the sequence, an additional classification bin is introduced for each auxiliary prediction that represents unassignable information. The auxiliary predictions were only used for training but could yield additional insights in ProSPr applications.

### Training of ProSPr

ProSPr was trained on 64×64 crops extracted from the CATH S35 dataset^{7} with 26393, 1000, and 500 domains randomly selected as training, validation, and test sets, respectively (Supplementary Note). Initial weights were assigned randomly with Pytorch, the loss was calculated using cross entropy and an Adam optimizer with learning rate of 0.001 was used to update the weights. Total loss was calculated as the weighted sum of ten times the distance loss, the losses of two secondary structure assignments, and the losses of 4 torsion angles assignments. Training loss and validation loss converged after 500,000 iterations with training batch sizes of 8 (Supplementary Figure S1), which corresponds approximately to the number of total crops necessary to visit each subdomain in the training set once. The training of ProSPr2 and ProSPr3 used the same setup, only the input vectors contained different amount of information. For ProSPr2 all layers that contained Potts information were set to zero. For ProSPr3 the PSSM and HHBlits layers were also set to zero. For these networks, the training loss did not converge within 500,000 iterations (Supplementary Figure S1).

### Convert distances into contacts

As a test, the distances for 109 CASP13 domains, which were not included in the training or validation sets, were predicted and converted into contacts. Instead of using all possible 64×64 crops, a stride of 25 was used between the crops to speed up evaluation of large domains. Average contact scores improved by 1% when a stride of 1 was used for the 44 shortest domains. The 64×64×64 distance output encodes the probability of a residue i and j to have distances either not assignable (e.g. gap in sequence), in the range of 2.3 – 22 Å with .3 Å resolution between classes, or greater than 22 Å. If the maximum of the probability distribution fell between 2.3 – 8 Å (bins 1-19), we considered two residues in contact for the high confidence predictions. In all cases, contacts were ranked according to the sum probability of distances between 2.3 and 8 Å and the top L, L/2, L/5 (L is length of sequence) contacts were selected to calculate accuracy scores. The contacts were classified based on the sequence separation of residues *i* and *j* into: short-range (6≤|*i* − *j*|≤11), medium-range (12≤|*i* − *j*|≤23) and long-range (|*i* − *j*|≥24) contacts.

### Evaluation of contact accuracy

According to CASP protocol, precision was calculated as follows:

The average in each category was calculated over 109 test domains from CASP13. For the comparison with TripletRes, the difference in average precision per category was again averaged.

### Installation instruction for Docker

To install ProSPr as a docker container and to see all currently available options enter in the command line (after installing docker):

Yes, it is that easy!

## Acknowledgments

DDC expresses gratitude for computational resources offered by BYU Office of Research Computing.