Deep learning-based prediction of protein structure using learned representations of multiple sequence alignments

The use of amino acid covariation and other sequence-based features as inputs to deep learning-based predictors of contacts and distances in proteins is now commonplace. The prediction process usually begins by constructing a multiple sequence alignment (MSA) containing homologues of the target protein. The most successful approaches combine large feature sets derived from MSAs, meaning that considerable computational effort is spent deriving these input features. We present a method that greatly reduces the amount of preprocessing required for a target MSA, making the predictor faster to run and easier to install and use. Our approach constructs a directly learned representation of all the sequences in an MSA, starting from a one-hot encoding of the sequences. The learned representation is then used as the input to a ResNet, the latter being the now-standard deep architecture for contact and distance prediction. When supplemented with a fast approximation of a precision matrix, the learned representation can be used to produce distance predictions of comparable or greater accuracy as compared to our original DMPfold method. Constructing representations of complete MSAs also opens up ways of deriving other informative properties, such as predictions of likely eventual model accuracy derived solely by looking at the MSA, as well as a complete end-to-end method for directly predicting α-carbon coordinates, again directly from the MSA alone. Our methods will be made available on GitHub under a permissive license, as part of an upcoming new version of DMPfold.


Introduction
Analysis of amino acid residue covariation in deep multiple sequence alignments (MSAs) has revealed that covarying residues are frequently found to be close together in the tertiary structure. This principle has been successfully exploited for predicting inter-residue contacts by a variety of methods. Notable among these are the family of methods based on direct-coupling analysis (DCA) [1][2][3] , and more recently, a number of methods based on deep learning. The latter have produced increasingly precise predictions of inter-residue contacts in recent years [4][5][6] , and have also been adapted to output probabilistic predictions of the distance between the residues in contact, either as a probability distribution [7][8][9][10] , or as real values [11][12][13] . These predictions are usually made by models that operate on precomputed features derived from an MSA, such as covariance and/or precision matrices, contact predictions from DCA-based methods, together with other features such as predicted secondary structure labels and sequence profiles. This approach, though effective, requires that these features be precomputed, which can in some cases be a time-consuming process and can sometimes take longer than the rest of the prediction pipeline combined. Additionally, it does not allow the neural network (NN) model to utilise all the information that might be available in the sequence alignment, as a model trained on derived features is limited to using the information available in those handpicked features.
A key difficulty in using MSAs directly as input is the fact that an MSA for a target sequence can have arbitrarily many sequences in it. Only one published method, rawMSA [14] , attempts to use an encoding of the MSA itself as the input to a deep neural network. In this method, the difficulty of embedding MSAs of arbitrary depth was addressed by training convolutional networks using predetermined maximum MSA depths. For deeper target MSAs, only a subset of sequences of the MSA are used, thus potentially discarding valuable additional information in the MSA. Here, we demonstrate that it is possible to use all the sequence information in the MSA when constructing a learned representation, and that such a representation can be used for effective prediction of structural features.
In this paper, we present: (a) a method for directly processing an MSA into a learned representation and its use in predicting structural features in proteins; (b) a method which predicts eventual model quality (in terms of TM-score [15] ) using this learned representation; and (c) a method for end-to-end prediction of Cα coordinates from the MSA embedding. We show that despite the relatively simple composition of the input feature set, we are able to obtain predictions of comparable and sometimes better accuracy than DMPfold [9] .

Datasets for training and evaluation
2.1.1. MSA embedding and structural feature prediction network Training was conducted on a set of 31159 domains from the V4.2 CATH s35 representative set of domains, supplemented with 6742 full length chains from the original DMPfold1 training set. A set of 300 chains were held out from training to use as a validation set to monitor convergence. MSAs for each training example in the CATH s35 set were created using 3 iterations of HHblits v3.0b3 [16] using the UniClust30 (October 2017) database, with an E-value threshold of 0.001.
We tested the effectiveness of our method on a set of 39 domains from the CASP13 experiment, categorized as either FM or FM/TBM by the CASP13 assessors. MSAs were built using an approach similar to that taken by our group in CASP13 [5] , using the UniClust30 2020_03 database, plus protein sequences from the UniRef100 [17] , EBI MGnify [18] , IMG/M [19] , MetaEuk [20] , and NCBI Transcriptome Shotgun Assembly databases. The latter five databases were searched when the initial HHblits/UniRef30 search returned fewer than 2000 hits; otherwise the HHblits MSA was used. The extended sequence databases were searched using 2 rounds of an iterative procedure comprising: hmmsearch [21] against the databases using a query MSA and a permissive E-value threshold; clustering the hits and aligning them using kClust [22] and MAFFT [23] ; preparing a custom database for HHblits using the aligned clusters; and finally, an HHblits search against this database.

TM-score prediction network
Beginning from all the domains in CASP8-12 with available coordinates, domains with greater than 30% sequence identity covering at least 60% of the protein length were clustered together using BlastClust [24] . A representative target was taken from each cluster at random to yield 500 non-redundant training domains overall.
An MSA was generated for each target domain using HHBlits searches against the UniClust30 (October 2017) database with the following parameters: E-value cutoff of 0.1, infinite number of effective sequences, maximum pairwise sequence identity of 90 and the '-diff' parameter set to inf. These parameters were intentionally kept lenient in order to increase the diversity of sequences within each MSA. To enhance the number of datapoints, we subsampled each alignment by generating sub-alignments starting from the target sequence only, and progressively adding 5 sequences at random from the full MSA in a stepwise manner. In most cases, the full alignments were not utilised. In total, 94742 alignments were generated across 500 target domains.

NN Model architectures, input and output features
2.2.1. MSA embedding and structural feature prediction network A schematic representation of the MSA embedding procedure is shown in Figure 1. We used a system of gated recurrent unit (GRU) layers to process and embed the input MSA, and output a feature map with 256 x L values, where L is the length of the target sequence. Starting from a 22-dimensional one-hot encoding of each residue in each sequence in the MSA, a stack of 2 GRU layers scans individual columns in the MSA in the vertical direction. The hidden state of these layers is a vector of fixed size (512 in our case), and the hidden state at the end of the vertical scan is used as a fixed-length representation of the information in each column of the MSA. The per-column representations are then passed to another stack of 2 bidirectional GRU layers with 256 hidden units, producing final embeddings for each column of 2 x 256 values. These embeddings are striped vertically and horizontally (similar to the manner in which per-residue features are prepared for use by 2D convolutional layers [4,5] ), and combined with a fast approximation of a precision matrix, calculated using the fast_dca algorithm from trRosetta [10] . These features are then used as input to a convolutional Maxout layer, followed by a series of 16 residual blocks composed of one maxout layer [25] and one squeeze-excitation layer [26] each. A schematic of the overall neural network system is shown in Figure 2.
The outputs of the ResNet are a collection of structural features for each residue pair, which we collectively term a "multigram": (a) binned distance predictions (distograms) in 34 bins of 0.5 Å width, with the first bin representing distances <4 Å, and the last bin representing distances >20 Å; (b) hydrogen bond predictions as in DMPfold1; and (c) backbone phi and psi torsion angle predictions, each represented as values in 34 bins covering the range [0,2π]. The different structural features are predicted jointly by the same neural net and we colloquially refer to these combined outputs as a "multigram". For inference, we trained four different versions of the neural net and averaged their predictions. As with DMPfold1 [9] , we also trained "iterative" versions of the neural net model, with an extra input feature channel carrying pairwise Cα distances from an existing 3-D structure. This extra feature channel is added in the inputs to the ResNet section of the model.

TM-score prediction network
The input features to the TM-score prediction network are derived from a 'differential distogram'. This distogram is constructed as the difference between two distograms: one generated using a given sub-alignment, and the other generated using only the target sequence. The differential representation describes the change in probability (Δp) at each i,j,k position of the distogram as a result of the sequences that accompany the target sequence in the sub-alignment. We further divide the differential distogram according to residue separation between i,j residues, to consider the predictive power of Δp in the context of short-(2<|i-j|<6), medium-(5<|i-j|<24) and long-range (|i-j|>=24) contacts.
To generate a feature set of fixed length and which is invariant to the protein length, we bin all Δp values within each of the 34 distance bins into a one-dimensional distribution of Δp values. Distance bins 1-33 are each binned into 20 bins (from Δp of -0.1 to +0.1 with width of 0.01). Distance bin 34 has an open ended distance threshold of >20 Å. As such, the Δp distribution is wider, and is instead binned into 40 bins from -0.2 to +0.2 (bin width of 0.01). Overall, distance bins 1-33 and 34 contribute 660 and 40 features, respectively. Finally, the bin counts are flattened into a 1x700 feature array.
Using the above approach, we assessed a total of ten feature sets, consisting of either the binned Δp values derived from short, medium or long-range contacts directly (700 features), or summed or concatenated combinations (700, 1400 or 2100 features). Thus our feature sets can also be grouped according to the number of features used: The TM-score prediction network is a fully connected neural network consisting of 9 hidden layers, each with 200 neurons. The rectified linear unit activation function (ReLU) is applied to each hidden layer.
Given an alignment, the aim of the TM-score prediction network is to predict the potential TM-score associated with a generated model when using the alignment as an input to a model generation pipeline. Since multiple model conformations can arise from the same alignment, we predict the potential TM-score as a distribution rather than as a single value. To do this, we quantize the TM-score into an ordered set of 100 labels as . The network 0.01, 0.02, ..., 1.00} { produces 100 log softmax outputs corresponding to the TM-score distribution where y ∈ ℝ 100 y i is the probability that input has a TM-score of . The final TM-score is returned as the X /100 i expected value of the probability distribution.

Protein model generation
a) Distance geometry and simulated annealing: DMPfold1 predictions were made using the CNS-based procedure as previously described [9] . The structural features predicted by the DMPfold2 NN model were used as constraints for the distance geometry and simulated annealing procedure in CNS [27] , which is identical to that used in DMPfold1. Procedures for converting the distance and torsion angle predictions into upper and lower bounds were optimised in a similar manner as in DMPfold1. After the first round of predictions, one additional round of the iterative distance prediction and model generation step was run to get the final model for both DMPfold1 and 2. 50 candidate models were generated at each step. No all-atom refinement was performed on the models. b) End-to-end prediction of coordinates from the MSA: To evaluate the ability of the DMPfold2 model to be trained and used in a purely end-to-end fashion, a modified network architecture was used with the normal multigram convolutional output layer replaced by a single channel convolutional layer. The outputs from this layer are converted to a real-value distance matrix by first averaging across the diagonal to ensure a symmetric matrix and then taking the D absolute values. This matrix is projected to 3-D Cα coordinates using multi-dimensional scaling [28] . The following (Gram) matrix is defined: The eigendecomposition is then calculated, where gives the SU M = U T U X = √S coordinates of the points. Note that if the atoms can be fully embedded in 3-D space there will only be 3 non-zero eigenvalues of M, but this is not necessarily the case for the real-value distance matrix predicted by the network. Coordinates corresponding to the largest 3 eigenvalues were therefore calculated and the rest discarded. Once the 3-D coordinates are obtained, the distance map computed from these coordinates is embeddable in 3-D space. As distances are invariant under mirror symmetry, the multidimensional scaling can produce a mirror of the correct structure arbitrarily. This can be manually detected by evaluating the overall distribution of torsion angles, and a mirror transformation applied to the coordinates if required. In our end-to-end model, however, we were able to avoid these steps by simply feeding the output coordinates through a final 2-layer bidirectional GRU recurrent network (256 weights per hidden layer), with 3 input channels and a final fully connected layer with 512 inputs and 3 outputs. During training with coordinate RMSD loss, this final network is able to learn to automatically transform the output coordinates for any structures mirrored by the multidimensional scaling process, without manual intervention.
The current version of the DMPfold2 end-to-end model predicts Cα coordinates only, however the missing main-chain and Cβ atom positions can very quickly be reconstructed using either PULCHRA [29] or the 'catomain' routine from the DRAGON method [30]. No further optimisation or refinement of the structure was attempted in any of the results shown.

MSA embedding and structural feature prediction network
The DMPfold2 network was implemented using PyTorch [31] and trained using the Adam optimizer [32] and focal loss [33] , with parameter α set to 0.5. We found that focal loss is better able than cross-entropy loss to deal with the imbalance in the ground truth labels (distance distributions at different sequence separation ranges) and can also focus learning on the harder-to-predict long-range distances by automatically reducing the weight on easy-to-predict features of the multigram as training proceeds. An initial learning rate of 0.0003 was used, a dropout probability of 0.1 was used for the recurrent network layers, and dropout 0.2 used in the convolutional layers. For data augmentation, at each epoch each alignment in the training set is subsampled as follows: a) Random rows from the alignment are selected, up to a maximum of 1000, though always including the target sequence in the first row. b) Columns with gaps in just one randomly selected alignment row are then deleted. This simulates an evolutionary deletion process in the target sequence, producing biologically relevant random crops of the target sequences i.e. crops that are consistent with previously observed deletions in the alignment. These deleted columns are also removed from the target tensors so that the target distograms, for example, have the same number of rows and columns as the length-reduced target sequence.
Four separate networks were trained from different random weight initializations, so that outputs from this ensemble of networks could be averaged during inference.
For training the end-to-end model of DMPfold2, the same training procedure was followed, though with RMSD coordinate error as the loss function, and with a fine-tuning procedure for weights in common with the standard multigram network i.e. those weights were initialized with the values taken from one of the original fully-trained DMPfold2 networks. In this case, the RMSD between the output Cα coordinates and the experimental structure Cα coordinates is calculated using a fast quaternion-based method [34] , and this was used as the loss function for backpropagation rather than focal loss. In order to avoid numerical instability during training, the final square root in the RMSD calculation is omitted. No model ensembling was used in this case as it would be highly suboptimal to produce a simple average of predicted real coordinates, though we may investigate other ensembling strategies in the future.

TM-score prediction network
The TM-score prediction network was trained using the Adam optimizer with an initial learning rate of 0.001. Losses are calculated using Kullback-Leibler (KL) divergence and are back-propagated with respect to a known TM-score distribution that is derived from a sample of 50 models generated using DMPfold2 with the CNS-based modelling procedure (Section 2.3a) from each alignment.
To assess the performance of the ten network variants, we trained each using 450 domains from CASP8-12 (85033 alignments) and used 35 domains from CASP13 FM targets for testing (2586 alignments). 50 domains from CASP8-12 (9709 alignments) were selected at random and used for validation to prevent overfitting. The same set of validation targets were maintained when training each network variant. Each network was trained for up to 200 epochs and early stopping with a patience of 50 steps was employed to monitor the validation loss to avoid overfitting.

Performance evaluation procedures
2.5.1. Distances, contacts and tertiary structure predictions To evaluate distance predictions (distograms), we used a simple measure of the quality of the distograms. For a given residue pair, we calculated the maximum likelihood distance from the distograms by removing the last bin (corresponding to a distance of >20 Å), re-normalising the remainder of the bin probabilities to sum to 1, and summing the bin centre distances multiplied by the re-weighted probabilities. The absolute difference between this maximum likelihood distance and the native distance was calculated, and the mean taken over all residue pairs separated by at least 5 residues in sequence and closer than 20 Å in the native structure.
Distance predictions (distograms) were also converted to binary contact predictions by summing up the probabilities in the distogram up to the bin ending at 8 Å. Contact predictions were assessed by ranking the contacts in descending order of scores, and evaluating the precision of the top-L/x contacts, where L is the length of the target sequence, and x is one of 1, 2, 5, or 10.

TM-score predictions
TM-score prediction accuracy was assessed as the mean absolute error (MAE) with respect to the true TM-score. The quoted MAE of each network variant was calculated as the average MAE of each target set of data points. We additionally calculated the average potential loss in TM-score for each network variant. This metric is calculated by taking the absolute difference between the actual TM-scores of the predicted best MSA and the true best MSA. One TM-score loss value is calculated per target, and the quoted value is the average over all targets.

Accuracy of distance predictions
We evaluated the distance predictions from the network by first converting them into binary contacts using the standard 8 Å threshold. Table 1 shows the long-range precision of these contact predictions, comparing the outputs from the DMPfold1 NNs and those from the NNs using MSA embedding, which we will denote as DMPfold2 for convenience. We compared both versions of the DMPfold distance prediction NNs against rawMSA, the only other published method that uses a directly learned representation of the input multiple sequence alignment. The comparison shows that the new method employing a learned representation of the MSA is comparable to the DMPfold1 NNs in terms of contact precision on the 39 CASP13 domains, and on some measures is slightly better. Both methods are substantially more precise than rawMSA. These results are encouraging given that we have replaced a large set of the input features used in DMPfold1 with a learned representation of the MSA. Although the best performance is obtained when supplementing the learned representation with a precision matrix, these observations suggest that accurate prediction of structural features with minimal pre-processing of the MSA is possible.
Although rawMSA does not predict residue-residue distances, we can compare distance predictions for DMPfold1 and DMPfold2. The MAE between the maximum likelihood predicted distance and the native distance was calculated for all native residue pairs closer than 20 Å. This value is 2.01 Å for DMPfold2 and 2.23 Å for DMPfold1, indicating an improvement in performance.

Tertiary structure model accuracy
As an additional evaluation of the effectiveness of the MSA embedding, we built 3-D models of 39 FM and FM/TBM targets from the CASP13 experiment, using the DMPfold2 NNs and the same CNS-based model building procedure used in DMPfold1, as well as the end-to-end model generation procedure.
Comparison against models generated using the DMPfold1 NNs ( Figure 3 and Table 2) shows that the DMPfold2 NNs produce predictions that, on average, enable more accurate structure modelling for these domains. DMPfold2 is able to fold 30 domains to a TM-score of 0.5 or greater, as compared to 26 for DMPfold1. The mean TM-score is 0.557 for DMPfold2 and 0.531 for DMPfold1. Additional benefits may be realised by using different model building procedures, and we are currently experimenting with a variety of model building protocols with a view to improve model quality. Figure 4 compares the TM-scores of models generated by DMPfold2, using either the standard CNS-based structure reconstruction approach, or coordinates predicted by the end-to-end version of the neural networks. We found that using either PULCHRA or catomain to build the missing main-chain and Cβ positions gave very similar results, with catomain producing a slightly higher mean TM-score. It can be seen that although the models generated by the end-to-end version are not as precise as the CNS models on average (mean TM-score of 0.484 compared to 0.557 for the CNS version), they are still often of acceptable quality, especially when considering the relatively straightforward method used to reconstruct atom positions.
Interestingly, there was one target (T0986s2-D1) for which the end-to-end prediction produced the correct fold where CNS could not. The end-to-end prediction for this domain had a TM-score of 0.52 compared to 0.23 for the CNS version.

Evaluation of model quality estimates predicted directly from MSAs
We developed a network that predicted the eventual model TM-score directly from an MSA. Ten variants of this network were trained, using features that described short, medium and long-range interactions, or combinations thereof (Table 3 and Figure 5). These networks each take an MSA as input and outputs a predicted TM-score distribution. The performance of each network is assessed as the average MAE between the predicted and actual TM-score. The latter is evaluated as the mean TM-score of 50 models that were produced with DMPfold2 when given the MSA as an input. We assessed the performance of each network by testing on the CASP13 target domains.
Comparison of the ten network variants reveals that the lowest average MAE of 0.062 is achieved using the sum(short, med) & long features. This feature set fundamentally consist of short/medium-range contacts (2>|i-j|>24) and long-range contacts (|i-j|>=24). While the lowest MAE is achieved using the sum(short, med) & long features, similarly low MAE values were also obtained from other feature sets. The MAE values of the ten networks cluster into two groupsthose that achieve above or below an MAE of 0.09. This clustering is paralleled by the inclusion of long-range features which result in a lower MAE being achieved. However, while it appears that short and medium-range contact-based features are less informative on their own, their inclusion into the network synergistically complements the predictive power of the long-range features.
A key benefit of being able to predict the eventual TM-score is that MSAs can be ranked according to their predicted modelling power, prior to the time-consuming step of model generation. To assess the ranking power of the networks, we calculated the potential loss in TM-score for each set of target alignments for each network variant (Table 3). This metric answers the following question: what is the potential loss in the actual TM-score if alignments are ranked according to the TM-score predictions? The lowest average potential TM-score loss was 0.0338 using the short & sum(med, long) features. However, similarly low TM-score loss can be observed across the other networks tested, even for networks which do not achieve a low MAE. Taken together, this indicates that meaningful MSA ranking does not necessarily rely on accurate TM-score predictions.

Conclusions and future directions
Recent successes in applying deep learning to protein structure prediction have mostly relied on the use of large, precomputed feature sets as inputs to the deep learning model. Here, we show that it is possible to directly process a multiple sequence alignment into a learned representation that is more effective for predicting structural features in proteins. Using this approach, we are developing the next generation of our DMPfold method for deep learning-based structure prediction. The distance predictions from the DMPfold2 NNs are of comparable or higher precision than those obtained from DMPfold1 NNs, and the structural models built from the DMPfold2 restraint sets are also more accurate on average than those from DMPfold1. It is expected that allowing the network to access all the information in a raw MSA, rather than just pairwise frequency information, for example, enables it to extract richer information that can be used for more accurate prediction of structural features. Although the best performance is obtained when using the learned MSA representation alongside an on-the-fly computed precision matrix, the work in this study opens up the possibility of using just the MSA representation itself as the sole input to the ResNet for predicting protein structure, though that will probably require using methods with better ability to deal with very long range dependencies (see below). It also enables new lines of work that were prohibitive with large feature sets (such as those used in DMPfold1), due to the time and storage requirements of using those features.
As an example, we developed a predictor of model TM-score that uses different versions of predicted distograms as input, each of which was generated using different versions of an input MSA. Training a predictor in this fashion would have been extremely time-consuming with a more traditional large feature set. The ability to predict TM-score from an MSA, together with the ability to easily work with different versions of an MSA, opens up the possibility of optimizing the set of sequences in an MSA so as to produce the most accurate structure predictions. We trialled the use of such a procedure in the CASP14 experiment.
The ability of DMPfold2 to be easily trained and used in a strictly end-to-end manner also opens up new possibilities. The idea of end-to-end de novo prediction [35,36] has certainly been tantalizing, in that a 3-D model can be produced in a fraction of a second compared to the hours or days needed previously. So far, however, published end-to-end methods have not been able to produce models comparable to the state-of-the-art in de novo prediction, mainly because they have not effectively exploited covariation data as inputs. Indeed, the very idea of having to spend time pre-calculating covariation input sequence features would obviously devalue the whole concept of end-to-end modelling. By combining direct MSA embedding with the idea of end-to-end coordinate generation by a learned multidimensional scaling process in the neural network, DMPfold2 is able to produce Cα coordinates comparable in accuracy to the full CNS modelling approach in just a few hundred milliseconds per target. This will allow quick validation of de novo predictions before full modelling is carried out, or could be combined with a refinement method to almost completely replace the whole 3-D modelling pipeline. Another usage example would be to visualize changes to the final 3-D model as the input sequences are changed, virtually in real time.
The use of two gated recurrent networks to embed the entire MSA is clearly effective, but was something of a design compromise in that RNNs are relatively fast to train, but have known limitations in terms of the limits of modelling long range dependencies in sequences. In theory, gated RNNs are able to avoid the problems of vanishing gradients when modelling long sequences, but in practice, dependencies beyond a window of a few hundred time steps are poorly modelled. For this reason we used two GRU networks, one to embed in the vertical (sequence number) and one in the horizontal (residue number) direction so that the number of time steps that each GRU would need to model would be limited by either the lengths of typical protein domains or the depths of typical MSAs. For MSAs with longer sequences or with many more homologous sequences, even gated RNNs will start to become ineffective. We are currently investigating alternative means of embedding MSAs, such as the use of models based on new efficient transformer architectures [37] , which are far more memory-efficient than the original Transformer [38] due to avoiding the calculation of large self-attention matrices over the length of the sequences. Standard transformer models have already been used to embed unaligned protein sequences [39][40][41] , but the most efficient transformer models released in the last year are now capable of handling sequence lengths even in the millions, and so this suggests that a single deep transformer model with a compressed self-attention mechanism could, in principle, embed a whole MSA in one go by treating it as a single sequence. Current experiments along these lines are promising.
Overall, we have demonstrated that the idea of embedding whole MSAs into a linear representation using standard language modelling approaches can produce excellent results, both in terms of residue contact and distance prediction accuracy, and the final generation of 3-D structures directly from sequence. By being able to directly link individual amino acids in an MSA to the outputs of the network, many structural bioinformatics applications are made easier e.g. modelling variant effects or protein design. At the very least, these direct MSA embedding methods make de novo protein structure prediction methods far more efficient and easier to use. Figure 1. Details of the MSA embedding procedure. The MSA is represented by a one-hot encoding of 22 residue types (including gaps and unknown residues). First, the residues in a single column of the MSA are treated as timesteps and fed as input to a stack of two Gated Recurrent Unit (GRU) network layers. The final hidden state of the second GRU, obtained after processing the whole column of the MSA, is used as an embedding of the information in that column. The process is then repeated for the remaining columns in the MSA, producing a separate embedding for each MSA column. Finally, these per-column embeddings are used as inputs to a stack of 2 bidirectional GRU layers that produces an embedding of all sequences and columns in the MSA. The dimensions of the input tensor and embeddings are shown in parentheses.  Figure  1. This embedding is combined with a precision matrix calculated from the MSA, and fed to a convolutional Maxout layer to reduce its dimensionality, before being fed to a series of 16 residual neural net (ResNet) blocks. Each block is composed of a convolutional Maxout layer and a Squeeze-excitation layer. The outputs from the network are collected as the outputs of a 2D convolutional layer with 1x1 filter, and are a combination of different structural features, represented in the right-hand column. All outputs from the network are predicted jointly.  Table 2. A dashed line of unit slope is drawn, as well as segments demarcating the TM-score >= 0.5 regions of the plot. Overall, DMPfold2 achieves comparable or greater model accuracy than DMPfold1, using a similar model generation strategy.  Figure 3. For the end-to-end model, data are shown using the 'catomain' procedure for rebuilding all main-chain and Cβ positions. Similar results are obtained when using PULCHRA to rebuild these atom positions. One target, T0986s2-D1, could be folded correctly (TM-score>0.5) by the end-to-end procedure but not by CNS. (ii) Comparison of actual TM-scores of predicted best alignment and actual best alignment. Data points in red (9 targets) indicate that the actual best alignment has been correctly predicted by the network when ranked by predicted TM-score. (c) Examples of actual (blue) vs predicted (orange) TM-score distributions. Blue and orange dotted lines represent the mean TM-score of 50 models generated using the alignment, and the expected value of the predicted TM-score distribution respectively. ΔTM and KL labels show the numerical difference in TM-score between the two dotted lines, and the KL-divergence between the two distributions, respectively. The data shown is from the network variant trained using sum(short, med) & long features. Tables   Table 1. Comparison of long-range (|i-j| >= 24) contact precision on the 39 FM and FM/TBM domains from CASP13. Predictions for rawMSA were computed using MSAs built using only HHblits searches against the UniRef30 database, as we found that predictions made using these MSAs were more precise than those obtained using the deeper MSAs built using the metagenomic sequences. DMPfold 1 and 2 used the deeper MSAs.