## Abstract

**Motivation** Protein contacts contain key information for the understanding of protein structure and function and thus, contact prediction from sequence is an important problem. Recently exciting progress has been made on this problem, but the predicted contacts for proteins without many sequence homologs is still of low quality and not extremely useful for de novo structure prediction.

**Method** This paper presents a new deep learning method that predicts contacts by integrating both evolutionary coupling (EC) information and sequence conservation information through an ultra-deep neural network formed by two deep residual neural networks. The first residual network conducts a series of 1-dimensional convolutional transformation of sequential features; the second residual network conducts a series of 2-dimensional convolutional transformation of pairwise information including output of the first residual network, EC information and pairwise potential. This neural network allows us to model very complex relationship between sequence and contact map as well as long-range interdependency between contacts and thus, obtain high-quality contact prediction.

**Results** Our method greatly outperforms existing contact prediction methods and leads to much more accurate contact-assisted protein folding. For example, on the 105 CASP11 test proteins, the L/10 long-range accuracy obtained by our method is 83.3% while that by the state-of-the-art methods CCMpred and MetaPSICOV (the CASP11 winner) is 43.4% and 60.2%, respectively. On the 398 membrane proteins, the L/10 long-range accuracy obtained by our method is 77.3% while that by CCMpred and MetaPSICOV is 51.8% and 61.2%, respectively. Ab initio folding using our predicted contacts as restraints can yield correct folds (i.e., TMscore>0.6) for 224 of the 579 test proteins, while that using MetaPSICOV- and CCMpred-predicted contacts can do so for only 79 and 62 of them, respectively. Further, our contact-assisted models also have much better quality (especially for membrane proteins) than template-based models. Using our predicted contacts as restraints, we can fold 240 of the 398 membrane proteins with TMscore>0.5. However, when the training proteins of our method are used as templates, homology modeling can only do so for 10 of the membrane proteins.

## Introduction

De novo protein structure prediction from sequence alone is one of most challenging problems in computational biology. Recent progress has indicated that some correctly-predicted long-range contacts may allow accurate topology-level structure modeling [1] and that direct evolutionary coupling (EC) analysis of multiple sequence alignment (MSA) [2] may reveal some long-range native contacts for proteins with a large number of sequence homologs [3]. Therefore, contact prediction and contact-assisted protein folding has recently gained much attention in the community. However, for many proteins especially those without many sequence homologs, the predicted contacts by the state-of-the-art predictors such as CCMpred [4], PSICOV [5], Evfold [6], MetaPSICOV [7] and CoinDCA [8] are still of low quality and insufficient for accurate contact-assisted protein folding [9]. This motivates us to develop a better contact prediction method, especially for proteins without a large number of sequence homologs. In this paper we say two residues form a contact if they are spatially proximal in the native structure, i.e., the Euclidean distance of their C_{β} atoms less than 8Å [10].

Existing contact prediction methods roughly belong to two categories: (i) unsupervised evolutionary coupling (EC) analysis that predicts contacts by identifying co-evolved residues in an MSA, such as EVfold [6], PSICOV [5], CCMpred [4], Gremlin [11], and others [12-14]; and (ii) supervised machine learning methods that predict contacts from a variety of evolutionary and co-evolutionary information, e.g., SVMSEQ [15], CMAPpro [10], PconsC2 [16], MetaPSICOV [7], PhyCMAP [17] and CoinDCA-NN [3]. Meanwhile, PconsC2 uses a 5-layer supervised learning architecture [16] and CoinDCA-NN and MetaPSICOV employ a 2-layer neural network [7]. CMAPpro uses a neural network with many more layers, but it is reported that its performance saturates at about 10 layers. Evolutionary coupling (EC) analysis needs a large number of sequence homologs to be effective [16][3]. Some supervised methods such as MetaPSICOV and CoinDCA-NN outperform unsupervised EC analysis on proteins without many sequence homologs, but their performance is still limited by their shallow architectures.

To further improve supervised learning methods for contact prediction, we borrow ideas from very recent breakthrough in computer vision. We have greatly improved contact prediction by developing a brand-new deep learning model called residual neural network [18] for contact prediction. Deep learning is a powerful machine learning technique and has revolutionized image classification [19, 20] and speech recognition [21]. In 2015, ultra-deep residual neural networks [22] demonstrated state-of-the-art performance in several computer vision challenges (similar to CASP) such as image classification [23] and object recognition [24]. If we treat a protein contact map as an image, then protein contact prediction is kind of similar to (but not exactly same as) pixel-level image labeling, so some techniques effective for image labeling may also work for contact prediction. However, it is not straightforward to apply image labeling techniques to contact prediction due to the following difference between contact prediction and image labeling. First, in computer vision community, image-level labeling (i.e., classification of a single image) has been extensively studied, but there are very fewer studies on pixel-level image labeling (i.e., classification of an individual pixel). Second, in many image classification scenarios, image size is usually resized to a fixed value, but we cannot resize a contact map since we need to do prediction for each residue pair (equivalent to an image pixel). Third, contact prediction has much more complex input features (including both sequential and pairwise features) than image labeling. Fourth, the ratio of contacts in a protein is very small (<10%). That is, the number of positive and negative labels in contact prediction is extremely unbalanced.

In this paper we present a very deep residual neural network for contact prediction. Such a network can capture very complex sequence-contact relationship and long-range interdependency between contacts of a protein. We train this deep neural network using a subset of proteins with solved structures and then test its performance on public data including the CASP [25, 26] and CAMEO [27] test proteins as well as membrane proteins. Our experimental results show that our method obtains much better prediction accuracy than existing methods and also result in much more accurate contact-assisted 3D structure modeling. The deep learning method described in this manuscript will also be useful for the prediction of protein-protein and protein-RNA interfacial contacts.

## Results

### Deep learning model for contact prediction

Figure 1 illustrates our deep neural network model for contact prediction [28]. Different from previous supervised learning approaches for contact prediction that employ only a small number of hidden layers (i.e., a shallow architecture), our deep neural network [22] employs dozens of hidden layers. By using a very deep architecture, our model can automatically learn the complex relationship between sequence information and contacts and also implicitly model the interdependency among contacts and thus, improve contact prediction [16]. Our model consists of two major modules, each being a residual neural network. The first module conducts a series of 1-dimensional (1D) convolutional transformations of sequential features (sequence profile, predicted secondary structure and solvent accessibility). The output of this 1D convolutional network is then converted to a 2-dimensional (2D) matrix by an operation similar to outer product and fed into the 2^{nd} module together with pairwise features (i.e., co-evolution information, pairwise contact and distance potential). The 2^{nd} module is a 2D residual network that conducts a series of 2D convolutional transformations of its input. Finally, the output of the 2D convolutional network is fed into a logistic regression, which predicts the probability of any two residues form a contact. In addition, each convolutional layer is also preceded by a simple nonlinear transformation called rectified linear unit [29]. The output of each 1D convolutional layer has dimension *L×m* where *L* is protein sequence length and *m* is the number of hidden neurons at one residue. The output of a 2D convolutional layer has dimension *L×L×n* where *n* is the number of hidden neurons for one residue pair. The number of hidden neurons may vary at each layer.

We tested our method using the 150 Pfam families described in [5], the 105 CASP11 test proteins [30], 398 membrane proteins (Supplementary Table 1) and 76 hard CAMEO test proteins released from 10/17/2015 to 04/09/2016 (Supplementary Table 2). We compare our method with some state-of-the-art methods including PSICOV [5], Evfold [6], CCMpred [4], and MetaPSICOV [7]. The former three predict contacts using direct evolutionary coupling analysis. CCMpred performs slightly better than PSICOV and Evfold. MetaPSICOV [7] is a supervised learning method and performed the best in CASP11 [30]. All the programs are run with parameters set according to their respective papers. We cannot evaluate PconsC2 [16] since we failed to obtain any results from its web server. PconsC2 did not outperform MetaPSICOV in CASP11 [30], so it may suffice to just compare our method with MetaPSICOV.

### Overall Performance

We evaluate the accuracy of the top *L/k* (*k*=10, 5, 2, 1) predicted contacts where L is protein sequence length [3]. The prediction accuracy is defined as the percentage of native contacts among the top *L/k* predicted contacts. We also divide contacts into three categories according to the sequence distance of two residues in a contact. That is, a contact is short-, medium- and long-range when the sequence distance falls into [6, 11], [12, 23], and ≥24, respectively.

As shown in Tables 1-4, our method outperforms CCMpred and MetaPSICOV by a very large margin on the 4 test sets regardless of how many top predicted contacts are evaluated and no matter whether the contacts are short-, medium- or long-range. The advantage of our method is the smallest on the 150 Pfam families because many of them have a pretty large number of sequence homologs. In terms of top L long-range contact accuracy, our method exceeds CCMpred and MetaPSICOV by 0.33 and 0.21, respectively, on the CASP11 set. On the CAMEO set, our method exceeds CCMpred and MetaPSICOV by 0.27 and 0.17, respectively. On the membrane protein set, our method exceeds CCMpred and MetaPSICOV by 0.28 and 0.19, respectively. Since the Pfam test set is relatively easy, in the following sections we will focus on the CASP11, CAMEO and membrane protein test sets.

### Accuracy with respect to the number of sequence homologs

To examine the performance of our method with respect to the amount of homologous information available for a protein under prediction, we measure the effective number of sequence homologs in MSA by *Meff* [17] (see Method for its formula). A protein with a smaller *Meff* has fewer non-redundant sequence homologs. We divide all the test proteins into 10 bins according to *ln(Meff)* and then calculate the average accuracy of the test proteins in each bin. We merge the first 3 bins for the membrane protein set since they contain a small number of proteins.

Fig. 2 shows the top L/5 contact prediction accuracy with respect to *ln(Meff)*. Roughly speaking, the prediction accuracy increases with respect to *Meff*, i.e., the amount of homologous information. Our method outperforms both MetaPSICOV and CCMpred no matter how much homologous information is available for the protein under prediction. Our method has an even bigger advantage when *ln(Meff)≤7* (equivalently *Meff<1100*). That is, our method works much better when the protein under prediction does not have a large number of non-redundant sequence homologs. Fig. 2 also shows that no matter how many sequence homologs are available, two supervised learning methods (MetaPSICOV and our method) greatly outperform the unsupervised EC analysis method CCMpred.

### Contact-assisted protein folding

One of the important goals of contact prediction is to perform contact-assisted protein folding [9]. To test if our contact prediction can lead to better 3D structure modeling than the others, we build structure models for all the test proteins using the top predicted contacts by our method, CCMpred, and MetaPSICOV, respectively. For each test protein, we feed the top predicted contacts as restraints into the CNS suite [33] to generate 3D models. We measure the quality of a 3D model by TMscore [34], which ranges from 0 to 1, with 0 indicating the worst and 1 the best, respectively.

As shown in Fig. 3, our predicted contacts can generate much better 3D models than CCMpred and MetaPSICOV. On average, the 3D models generated by our method are better than MetaPSICOV and CCMpred by ∼0.12 TMscore unit and ∼0.15 unit, respectively. The average TMscore of the top 1 models generated by CCMpred, MetaPSICOV, and our method is 0.30, 0.35, and 0.47, respectively on the CASP/CAMEO dataset. On the membrane protein set, the average TMscore of the top 1 models generated by CCMpred, MetaPSICOV and our method is 0.37, 0.39, and 0.52, respectively. On the CASP/CAMEO dataset, the average TMscore of the best of top 5 models generated by CCMpred, MetaPSICOV, and our method is 0.32, 0.37, and 0.49, respectively. On the membrane protein set, the average TMscore of the best of top 5 models generated by CCMpred, MetaPSICOV, and our method is 0.40, 0.42, and 0.55, respectively. In particular, when the best of top 5 models are considered, our method can result in correct folds (i.e., TMscore>0.6) for 224 of the 579 test proteins, while MetaPSICOV and CCMpred can lead to correct folds for only 79 and 62 proteins, respectively.

### Contact-assisted models vs. template-based models

To generate template-based models (TBMs) for a test protein, we first run HHblits (with the UniProt20_2016 library) to generate an HMM file for the test protein, then run HHsearch with this HMM file to search for the best templates among the 6767 training proteins, and finally run MODELLER to build a TBM from each of the top 5 templates. Fig. 4 shows the head-to-head comparison between our contact-assisted models and the TBMs on these three test sets. In summary, when only the first models are evaluated, our contact-assisted models for the 76 CAMEO test proteins have an average TMscore 0.410 while the TBMs have an average TMscore 0.317. On the 105 CASP11 test proteins, the average TMscore of our contact-assisted models is 0.516 while that of the TBMs is only 0.393. On the 398 membrane proteins, the average TMscore of our contact-assisted models is 0.524 while that of the TBMs is only 0.149. When the best of top 5 models are evaluated, on the 76 CAMEO test proteins, the average TMscore of our contact-assisted models is 0.427 while that of the TBMs is only 0.366. On the 105 CASP11 test proteins, the average TMscore of our contact-assisted models is 0.539 while that of the TBMs is only 0.441. On the 398 membrane proteins, the average TMscore of our contact-assisted models is 0.545 while that of the TBMs is only 0.187. These results indicate that when a query protein has no close templates, our contact-assisted model may have much better quality than TBM. These results imply that our deep learning model does not predict contacts by simply copying contacts from the training proteins. It also implies that contact-assisted modeling shall be very useful for membrane proteins since many of them have no close templates in PDB.

Further, our contact-assisted models have TMscore>0.5 for 23 of the 76 CAMEO test proteins while the TBMs have TMscore>0.5 for only 18 of them. Our contact-assisted models have TMscore >0.5 for 62 of the 105 CASP11 test proteins while the TBMs have TMscore>0.5 for only 44 of them. Our contact-assisted models have TMscore>0.5 for 240 of the 398 membrane proteins while the TBMs have TMscore >0.5 for only 10 of them. Our contact-assisted models for membrane proteins are much better than their TBMs because that very few of the 6767 training proteins are good templates for the 398 test membrane proteins. When the 219 test proteins with ≤500 non-redundant sequence homologs are evaluated, the average TMscore of the TBMs is 0.254 while that of our contact-assisted models is 0.43. Among these 219 proteins, our contact-assisted models have TMscore>0.5 for 73 of them while the TBMs have TMscore>0.5 for only 17 of them.

### Specific examples

Here we show the predicted contacts and contact-assisted models of two specific proteins Sin3a (PDB id: 2n2hB) and GP1 (PDB id: 4zjfA). Sin3a is a mainly-alpha protein consisting of two long and paired amphipathic helix [35]. The contact map predicted by our method has L/2 long-range accuracy 0.78 while that by MetaPSICOV has L/2 accuracy 0.35. As shown in the lower-right triangle of Figure 5(A), MetaPSICOV fails to predict the contacts between the paired amphipathic helices. As shown in Figure 5(C), the contact-assisted model built from MetaPSICOV-predicted contacts has TMscore only 0.359. By contrast, the model built from our predicted contacts has TMscore 0.591.

GP1 is the receptor binding domain of Lassa virus. It has a central β-sheet sandwiched (with 5 beta strands numbering from 1, 2, 7, 4, and 3) by the N and C termini on one side and an array of α-helices and loops on the other [36]. The key to form this fold lies in the placement of beta7 between beta3 and beta4, which are shown in the contact map around residue pairs (150, 40) and (150, 100). As shown in the upper-right triangle of Figure 5(B), our method successfully predicts these contacts and has L/2 long-range contact accuracy 0.72. The 3D model built from our contacts has TMscore 0.491, as shown in the right picture of Figure 5(D). On the contrary, MetaPSICOV predicts few contacts in these regions and its L/2 long-range accuracy is only 0.32. The 3D model built from the MetaPSICOV-predicted contacts has TMscore only 0.246, as shown in the left of Figure 5(D).

## Conclusion and Discussion

In this paper we have presented a new deep (supervised) learning method for protein contact prediction. Our method distinguishes itself from previous supervised learning methods in that our model employs two deep residual neural networks to model sequence-contact relationship, one for modeling of sequential features (i.e., sequence profile, predicted secondary structure and solvent accessibility) and the other for modeling of pairwise features (e.g., coevolution information). Ultra-deep residual network is the latest breakthrough in computer vision and has demonstrated the best performance in the computer vision challenge tasks (similar to CASP) in 2015. Our method is also unique in that we model a contact map as a single image and then conduct pixel-level labeling on the whole image. This allows us to take into consideration correlation among multiple sequentially - distant residue pairs. By contrast, existing supervised learning methods predict if tw o residues form a contact or not independent of the other residue pairs. Our experimental results show that our method dramatically improves contact prediction, exceeding currently the best methods (e.g., CCMpred, Evfold, PSICOV and MetaPSICOV) by a very large margin. Ab initio folding using our predicted contacts as restraints can also yield much better 3D structural models than the other contact prediction methods. Further, our experimental results also show that our contact-assisted models are much better than template-based models built from the training proteins of our contact prediction model. We expect that our contact prediction methods can help reveal much more biological insights for those protein families without any solved structures and close st ructural homologs.

In current implementation, we found out that our model achieves pretty good performance when using around 60-70 convolutional layers. A natural question to ask is can we further improve prediction accuracy by using many more convolutional layers? In computer vision, it has been shown that a 1001-layer residual neural network can yield better accuracy for image-level classification than a 100-layer network (but no result on pixel-level image labeling). Currently we cannot apply more than 100 layers to our model due to insufficient memory of a GPU card (12G). We are going to circumvent the memory limitation by extending our training algorithm so that it can run on multiple GPU cards. Then we will train a model with hundreds of layers to see if we can further improve prediction accuracy or not.

## Method

### Deep learning model details

#### Residual network blocks

Our network consists of two residual neural networks, each in turn consisting of some residual blocks concatenated together. Fig. 6 shows an example of a residual block consisting of 2 convolution layers and 2 activation layers. In this figure, X_{l} and X_{l+1} are the input and output of the block, respectively. The activation layer conducts a simple nonlinear transformation of its input without using any parameters. Here we use the ReLU activation function [29] for such a transformation. Let *f(X _{l}*) denote the result of X

_{l}going through the two activation layers and the two convolution layers. Then, X

_{l+1}is equal to

*X*+

_{l}*f(X*. That is, X

_{l})_{l+1}is a combination of X

_{l}and its nonlinear transformation. Since

*f(X*is equal to the difference between X

_{l})_{l+1}and X

_{l},

*f*is called residual function and this network called residual network. In the first residual network, X

_{l}and X

_{l+1}represent sequential features and have dimension L×n

_{l}and L×n

_{l+1}, respectively, where L is protein sequence length and n

_{l}(n

_{l+1}) can be interpreted as the number of features or hidden neurons at each position (i.e., residue). In the 2

^{nd}residual network, X

_{l}and X

_{l+1}represent pairwise features and have dimension L × L × n

_{l}and L × L× n

_{l+1}, respectively, where n

_{l}(n

_{l+1}) can be interpreted as the number of features or hidden neurons at one position (i.e., residue pair). Typically, we enforce n

_{l}≤ n

_{l+1}since one position at a higher level is supposed to carry more information. When n

_{l}< n

_{l+1}, in calculating X

_{l}+ f(X

_{l}) we shall pad zeros to X

_{l}so that it has the same dimension as X

_{l+1}. To speed up training, we also add a batch normalization layer [38] before each activation layer, which normalizes its input to have mean 0 and standard deviation 1. The filter size (i.e., window size) used by a 1D convolution layer is 17 while that used by a 2D convolution layer is 3×3 or 5×5. By stacking many residual blocks together, even if at each convolution layer we use a small window size, our network can model very long-range interdependency between input features and contacts as well as the long-range interdependency between two different residue pairs. We fix the depth (i.e., the number of convolution layers) of the 1D residual network to 6, but vary the depth of the 2D residual network. Our experimental results show that with ∼60 hidden neurons at each position and ∼60 convolution layers for the 2

^{nd}residual network, our model can yield pretty good performance. Note that it has been shown that for image classification a convolutional neural network with a smaller window size but many more layers usually outperforms a network with a larger window size but fewer layers. Further, a 2D convolutional neural network with a smaller window size also has a smaller number of parameters than a network with a larger window size.

Our deep learning method for contact prediction is unique in at least two aspects. First, our model employs two multi-layer residual neural networks, which have not been applied to contact prediction before. Residual neural networks can pass both linear and nonlinear information from end to end (i.e., from the initial input to the final output). Second, we do contact prediction on the whole contact map by treating it as an individual image. In contrast, previous supervised learning methods separate the prediction of one residue pair from the others. By doing contact prediction simultaneously for all the residue pairs of one protein sequence, we can easily model the long-range interdependency between two residue pairs and the long-range relationship between one contact and input features.

#### Conversion of sequential features to pairwise features

We convert the output of the first module of our model (i.e., the 1-d residual neural network) to a 2D representation using an operation similar to outer product. Simply speaking, let v={v_{1}, v_{2}, …, v_{i}, …, v_{L}} be the final output of the first module where L is protein sequence length and v_{i} is a feature vector storing the output information for residue *i*. For a pair of residues *i* and *j*, we concatenate v_{i}, v_{(i+j)/2} and v_{j} to a single vector and use it as one input feature of this residue pair. The input features for this pair also include mutual information, the EC information calculated by CCMpred and pairwise contact potential [39, 40].

#### Loss function

We use maximum-likelihood method to train model parameters. That is, we maximize the occurring probability of the native contacts (and non-contacts) of the training proteins. Therefore, the loss function is defined as the negative log-likelihood averaged over all the residue pairs of the training proteins. Since the ratio of contacts among all the residue pairs is very small, to make the training algorithm converge fast, we assign a larger weight to the residue pairs forming a contact. The weight is assigned such that the total weight assigned to contacts is approximately 1/8 of the number of non-contacts in the training set.

#### Regularization and optimization

To prevent overfitting, we employ L_{2}-norm regularization to reduce the parameter space. That is, we want to find a set of parameters with a small L_{2} norm to minimize the loss function, so the final objective function to be minimized is the sum of loss function and the L_{2} norm of the model parameters (multiplied by a regularization factor). We use a stochastic gradient descent algorithm to minimize the objective function. It takes 20-30 epochs (each epoch scans through all the training proteins exactly once) to obtain a very good solution. The whole algorithm is implemented by Theano [41] and mainly runs on a GPU card.

### Training and test data

We test our method using some public datasets, including the 150 Pfam families [5], the 105 CASP11 test proteins, 76 recently-released hard CAMEO test proteins (Supplementary Table 1) and 398 membrane proteins (Supplementary Table 2). For the CASP test proteins, we use the official domain definitions, but we do not parse a CAMEO or membrane protein into domains.

Our training set is a subset of PDB25 created in February 2015, in which any two proteins share less than 25% sequence identity. We exclude a protein from the training set if it satisfies one of the following conditions: (i) sequence length smaller than 26 or larger than 700, (ii) resolution worse than 2.5Å, (iii) has domains made up of multiple protein chains, (iv) no DSSP information, and (v) there is inconsistency between its PDB, DSSP and ASTRAL sequences [42]. Finally, we also exclude the proteins sharing >25% sequence identity or having a BLAST E-value <0.1 with any of our test proteins. In total there are 6767 proteins in our training set, from which we have trained 7 different models. For each model, we randomly sampled ∼6000 proteins from the training set to train the model and used the remaining proteins to validate the model and determine the hyper-parameters (i.e., regularization factor). The final model is the average of these 7 models.

### Protein features

We use similar but fewer protein features as MetaPSICOV. In particular, the input features include protein sequence profile (i.e., position-specific scoring matrix), predicted 3-state secondary structure and 3-state solvent accessibility, direct co-evolutionary information generated by CCMpred, mutual information and pairwise potential [39, 40]. To derive most features for a protein, we need to generate its MSA (multiple sequence alignment). For a training protein, we run PSI-BLAST (with E-value 0.001 and 3 iterations) to scan through the NR (non-redundant) protein sequence database dated in October 2012 to find its sequence homologs, and then build its MSA and sequence profile and predict other features (i.e., secondary structure and solvent accessibility).

For a test protein, we generate four different MSAs by running HHblits [43] with 3 iterations and E-value set to 0.001 and 1, respectively, to search through the uniprot20 HMM library released in November 2015 and February 2016. From each individual MSA, we derive one sequence profile and employ our in-house tool RaptorX-Property [44] to predict the secondary structure and solvent accessibility accordingly. That is, for each test protein we generate 4 sets of input features and accordingly 4 different contact predictions. Then we average these 4 predictions to obtain the final contact prediction. This averaged contact prediction is about 1-2% better than that predicted from a single set of features (detailed data not shown). Although currently there are quite a few approaches such as Evfold and PSICOV that can generate direct evolutionary coupling information, we only employ CCMpred to do so because it is very fast when running on a GPU card [4].

### Programs to compare and evaluation metrics

We compare our method with PSICOV [5], Evfold [6], CCMpred [4], and MetaPSICOV [7]. MetaPSICOV [7] performed the best in CASP11 [30]. All the programs are run with parameters set according to their respective papers. We evaluate the accuracy of the top *L/k* (*k*=10, 5, 2, 1) predicted contacts where L is protein sequence length [3]. The prediction accuracy is defined as the percentage of native contacts among the top *L/k* predicted contacts. We also divide contacts into three groups according to the sequence distance of two residues in a contact. That is, a contact is short-, medium- and long-range when its sequence distance falls into [6, 11], [12, 23], and ≥24, respectively.

### Calculation of Meff

Meff measures the amount of homologous information in an MSA (multiple sequence alignment). It can also be interpreted as the number of non-redundant sequences in an MSA. To calculate the Meff of an MSA, we first calculate the sequence identity between any two protein sequences in the MSA. Let a binary variable S_{ij} denote the similarity between two protein sequences i and j. S_{ij} is equal to 1 if and only if the sequence identity between i and j is at least 70%. For a protein i, we calculate the sum of S_{ij} over all the proteins (including itself) in the MSA and denote it as S_{i}. Finally, we calculate Meff as the sum of 1/S_{i} over all the protein sequences in this MSA.

### 3D model construction by contact-assisted folding

We use a similar approach as described in [9] to build the 3D models of a test protein by feeding predicted contacts and secondary structure to the Crystallography & NMR System (CNS) suite [33]. We predict secondary structure using our in-house tool RaptorX-Property [44] and then convert it to distance, angle and h-bond restraints using a script in the Confold package [9]. For each test protein, we choose top L predicted contacts (L is sequence length) no matter whether they are short-, medium- or long-range and then convert them to distance restraints. That is, a pair of residues predicted to form a contact is assumed to have distance between 3.5Å and 8.0 Å. Then, we generate twenty 3D structure models using CNS and select top 5 models by the NOE score yielded by CNS[33]. The NOE score mainly reflects the degree of violation of the model against the input constraints (i.e., predicted secondary structure and contacts). The lower the NOE score, the more likely the model has a higher quality. When CCMpred- and MetaPSICOV-predicted contacts are used to build 3D models, we also use the secondary structure predicted by RaptorX-Property to warrant a fair comparison.

### Template-based modeling (TBM) of the test proteins

To generate template-based models (TBMs) for a test protein, we first run HHblits (with the UniProt20_2016 library) to generate an HMM file for the test protein, then run HHsearch with this HMM file to search for the best templates among the 6767 training proteins of our deep learning model, and finally run MODELLER to build a TBM from each of the top 5 templates.

## Author contributions

J.X. conceived the project, developed the algorithm and wrote the paper. S.W. did data analysis and wrote the paper. S.S. helped developing the algorithm. R.Z. helped with data analysis. Z.L. helped with the algorithm development.