Protein Structural Alignments From Sequence

Computing sequence similarity is a fundamental task in biology, with alignment forming the basis for the annotation of genes and genomes and providing the core data structures for evolutionary analysis. Standard approaches are a mainstay of modern molecular biology and rely on variations of edit distance to obtain explicit alignments between pairs of biological sequences. However, sequence alignment algorithms struggle with remote homology tasks and cannot identify similarities between many pairs of proteins with similar structures and likely homology. Recent work suggests that using machine learning language models can improve remote homology detection. To this end, we introduce DeepBLAST, that obtains explicit alignments from residue embeddings learned from a protein language model integrated into an end-to-end differentiable alignment framework. This approach can be accelerated on the GPU architectures and outperforms conventional sequence alignment techniques in terms of both speed and accuracy when identifying structurally similar proteins.


Introduction
Proteins with unknown function can be annotated based on their similarity to proteins of known function. Protein similarity can be measured based on their sequence or their structure. A core tenet of molecular biology is that a protein's sequence determines structure which in turn determines the protein's biological function. If the structures are known, structural alignment is the preferred approach to align the protein residues and measure a protein similarity [1,2,3,4,5]. Unfortunately, most sequences lack a known structure. Up to now there have been on the order of 10 8 protein sequences discovered, but only 10 5 proteins have annotated functions [6] and only about 10 4 of those proteins have experimentally verified structures [7].
While structural similarity would also seem to be the most direct connection to function, sequence similarity has proven to be more pragmatic. In the case where the evolutionary distance is short, sequence similarity provides a more reliable indicator of function than the hard task of interpreting molecular structure. Thus sequence alignments are the dominant sourc of protein function inference.
Conceptually, similarity is assessed by evolutionary distance through estimating the number of mutations required to transform one protein sequence to another. Algorithms such as BLAST [8], HMMER [9] and Needleman-Wunsch [10] are the state-of-the-art methods for computing sequence alignments. However, proteins with similar structures do not necessarily have similar sequences [2], resulting in the under-performance of traditional sequence alignments in remote homology tasks [11,12].
Unfortunately, over half of all proteins do not have detectable homologs in standard sequence databases due to their distant evolutionary relationships [13]. Detecting these remote homologs would help us better understand mutagenesis [14], aid protein design [15], predict protein function [16], predict protein structure [17,18,19] and model evolution [20]. Moreover, these missing homologs could be critical in annotating genomes of organisms that have not been studied before. Therefore, for highly divergent genomes, structural comparison would potentially identify many functionally related proteins; which is unattainable due to the lack of structures. Thus rather than directly compare sequence to sequence, an attractive strategy is to infer structural features from each sequence and then compare those structural features. However, computing protein-structural alignments given only the protein sequence is still an outstanding challenge [21].
Recent work shows that large-scale direct structural alignment from sequence may be computationally tractable. For sufficiently small proteins ab initio modeling can predict protein structure from sequence. Consequently, structural alignments of laboriously generated ab initio predictions for genes with no recognizable sequence similarity have been aligned for multiple genomes; ergo, pure sequence-to-structure recognition. However these were computationally expensive calculations: on the order of ten-thousand CPU hours per gene [22,23].
Desirably then, a method that directly inferred structural properties and structurally informed alignments would not only enable remote homology detection for function prediction but also accelerate those laborious structure predictions as well. [18,19] This paper advances an approach to the grand challenge [21] of directly producing the structural alignment of two proteins from just their sequence.
Naturally, given enough training data, one can imagine many possible supervised learning approaches to this task. But as is so often the case, structurally labeled alignments are scarce due to the limited number of experimentally validated protein structures. Therefore the central objective of this paper is to validate our hypothesis that a particular unsupervised pre-training step will successfully extract structural features into a latent space. This is desirable, since there are abundantly available protein sequences for unsupervised estimation of this latent space. Our hypothesis requires that this latent space embeds more structural information than just the sequence itself carries. If true, then one can then subsequently use this embedding to enhance a supervised process over the limited set of structurally labeled data.
We validate this hypothesis by showing that the supervised training produces better structural alignments when using our unsupervised embedding in comparison to a naive sequence-only embedding. Our purpose here is not a contest against other possible methods but to determine that a languagemodel-derived, unsupervised, embedding carries structural features. Once that is established, then applications more sophisticated than our hypothesis validation harness become possible, and data-poor applications are made easier due to the abundance of data for training these language models.
To this end, we develop an end-to-end differentiable neural network that takes sequences as input and outputs a predicted structural alignment. This exploits transfer learning from a pre-trained protein language model that can represent a protein sequence as a set of residue embeddings. These residue-level embeddings are sequence position specific and can capture relevant structural information based on the contextual residues. The final step uses dynamic programming to estimate the expected alignment between two protein sequences, which is then fine-tuned against known structural alignments. Only recently has an efficient backpropagation through dynamic programming become practical [24]. This allows us to put together this end-to-end model for direct prediction of alignments from sequence inputs and train it on limited labeled datasets.
We show this type of model can be trained and validated on structural alignments. We further evaluate its precision and recall on a manually curated gold standard set of structures containing remote homologies that standard sequence alignment techniques such as BLAST and HMMER could not detect. As expected, the predicted alignments cannot perform as well as directly aligning known structures, but they outperform state-of-the-art sequence alignments by a sizeable margin.

Related Work
The most common approach to estimating sequence similarity is through computing a weighted edit distance, with weights derived to approximate the likelihood of a given mutation. Popular sequence alignment approaches often involve variants of Needleman-Wunsch [10] or Hidden Markov Models (HMM) that enable a probabilistic treatment of sequence alignment and alignment detection [25]. To limit the scope of this paper we do not discuss the many relevant works on multiple alignment [26,27]. On the other hand, tools such as TM-align [2], Dali [3], Fast [28], and Mammoth [4], have been designed to perform protein structural alignment when protein structures are available.
Recent studies have shown the benefits of using high-capacity self-supervised protein language models to predict protein structure [11,29,30,31,32,33,34]. Pre-training these protein models on large unlabeled datasets can reduce the number of labeled data required to train classifiers, while improving the generalizability.

Methods
Here we build on the notion that sequence alignment algorithms should be designed to capture structural similarity for protein sequences with corresponding well defined ensemble average protein structures As shown in previous work [2], reliable alignments for many distant sequence pairs is not possible with the conventional sequence alignment algorithms. Ideally, we would like align structures instead of sequences to determine which residues are structurally analogous, and we'd like to detect cases where the sequence similarity is low and the structure similarity is high. However, many structure alignment methods aim to maximize atomic overlap between structures, ignoring potentially relevant sequence information. Here, we aim to fuse the best of both worlds, leveraging information from the protein sequence to infer structurally relevant alignments. To be able to train our model to perform optimally at the alignment end goal defined in this manner, we create a workflow that is end-to-end differentiable, with loss that is informed from the structural alignment. A major technical hurdle here is the need to perform dynamic programming with an unknown position specific alignment scoring matrix. This paper exploits a recent innovation [24] that we use to build a method that enables differentiating through the dynamic programming step, allowing training of all salient parameters and alignment scoring matrices.
Since we have only sequence, we need to capture the structural propensities of a given sequence. Many potential choices for embeddings apropos to structurally informed tasks are computationally expensive [11,29,30,31,32,33,34]. For this paper we have selected a well vetted language model [11] to cast the protein residues to a latent space with the intention of recovering the underlying grammars behind protein sequences.
In order to develop a model that can perform structural alignments, we propose using protein structural alignments estimated using a widely adopted structure alignment tool TM-align [2] for training. Since most protein structural alignments can be represented as linear sequence alignments [4], we argue that differentiable dynamic programming coupled with language models can enable the estimation of more complex structural alignments. The high-level workflow behind our proposed DeepBLAST algorithm is given in Figure 1  Proteins X and Y are fed into the pretrained LSTM protein language model [11] to obtain embeddings H X and H Y . These residue-level embeddings are then propagated through the match embeddings (M) and gap embeddings (G) in order to obtain the match scores µ and the gap scores g as discussed in Section 2.1. The match and gap scores are used to evaluate the differentiable dynamic programming algorithm and generate a predicted alignment traceback as discussed in Section 2.2. These alignments can then be fine-tuned using a training dataset of ground truth alignments as discussed in Section 2.3 and Section 2.4.

Protein Language Modeling
In order to obtain an alignment from dynamic programming, the scoring parameters for matches and gaps must be obtained. We propose to utilize the pretrained protein language models to estimate these scoring parameters. These pretrained models ultimately construct a function, mapping a sequence of residues, represented as one-hot encodings, to a set of residue vectors, providing an alternative representation of these proteins. Often these models will learn these representations by being trained to predict randomly masked residues within a protein sequence. Multiple studies have shown the merits of these models when performing protein structure prediction, remote homology and protein design [30,35,29,32,31,33,34]. Here, we have used the pretrained LSTM PFam model from [11]. Using this pretrained language model, two proteins X and Y can be represented by embeddings H X ∈ R p×d and H Y ∈ R q×d , where p and q represent the lengths of proteins X and Y and d is the embedding dimension of the language model. Given these representations, we can construct mappings M and G to obtain match scores and gap scores for the differentiable dynamic programming as follows The functions M : R t×d → R t×d and G : R t×d → R t×d are intermediate functions that take in as input a set of t residue vectors. These functions are parameterized by LSTM networks, which can be fine-tuned through the backpropagation enabled by the differentiable dynamic programming. Activation functions σ µ and σ g are softplus and logsigmoid functions to ensure that the match scores µ are strictly positive and the gap scores g are strictly negative. These constraints are used to penalize gaps and reward matches. This also helps enforce identifiability of the model, which we have found to improve the accuracy of the model in practice.

Differentiable Dynamic Programming
Our proposed differential dynamic programming framework doesn't learn any parameters; it is designed purely to enable backpropagation to fine-tune the scoring functions M and G. Differentiable dynamic programming has been extensively explored in the context of dynamic time warping [36,24]. Koide et al [37] and Ofitserov et al [38] suggested that a differentiable Needleman-Wunsch alignment algorithm could be derived, but its implementation has remained elusive. Here, we provide the first GPU-accelerated implementation of the differentiable Needleman-Wunsch algorithm.
Previous work [24] has shown that backpropagation can be performed on dynamic programming algorithms by introducing smoothed maximum and argmax functions. Doing so will enable the computation of derivatives while providing a tight approximation to the optimal dynamic programming solution. The traditional Needleman-Wunsch algorithm can be defined with the following recursion where the alignment score v i,j is evaluated on position i in the first sequence X and on position j in the second sequence Y . Sequences X and Y are of lengths n and m respectively. µ i,j represents the log-odds score of residues X i and Y j being aligned and g ij represents the log-odds score of an insertion or a deletion at positions i and j. Due to the structure of dynamic programming problems, v n,m is guaranteed to be the optimal alignment score between the two sequences. Furthermore, the optimal alignment can be obtained by tracing the highest-scoring path through the alignment matrix via argmax operations.
As neither the max nor the argmax operations are differentiable, the alignment scores and the traceback cannot be differentiated in the traditional formulation of the traceback operations needed to generate alignments. Accordingly, Mensch et al [24] introduced smoothed differentiable operators where the smooth max operator max Ω (x) is given by the log sum exp function and the smoothed argmax Ω (x) is given by the softmax function. Since the softmax function can be derived from the derivative of max Ω , the traceback matrix can also obtained by differentiating the resulting alignment matrix. The resulting traceback matrix will yield the expected alignment between the two proteins.
Since the loss function is defined as the difference between the predicted traceback matrix and the ground truth traceback matrix, the derivatives of the traceback matrix also need to be defined, which requires both the computations of the directional derivatives and the local Hessians of the alignment matrix (Appendix A).
In practice, dynamic programming can be the major computational bottleneck due to GPU data transfer and the quadratic runtime of the Needleman-Wunsch algorithm. To address this, we have implemented a GPU-accelerated differentiable Needleman-Wunsch algorithm inspired by Manavski et al [39]. As can be seen from the benchmarks shown in Figure 5, this algorithm is an order of magnitude faster than the naive CPU-bound Needleman-Wunsch implementation. Furthermore, this algorithm can enable batching, allowing for multiple alignments to be processed in parallel. As shown in Figure 5, larger batch sizes can further improve the scaling over CPU-bound alignments.

Alignment Loss Function
By defining a loss function between the predicted alignment and the structural alignment from TMalign, we can evaluate the accuracy of DeepBLAST and fine-tune the functions M and G. Mensch et al [24] proposed using the Euclidean distance between the predicted and ground truth alignments as a loss function. In practice, we found that a cross-entropy loss provided more reasonable alignment results. This loss is given by where e * is the ground truth alignment and e is the predicted alignment. As shown in [24], the predicted traceback matrix represents the expectation across all possible predicted alignments, which is represented as a matrix of probabilities. As a result, the resulting alignment problem can be interpreted as a classification task of identifying whether two residues between a pair of proteins are alignable. This provides additional motivation for utilizing cross-entropy as a loss function.

Training
We trained DeepBLAST on 1.5M alignments from the PDB [40] obtained using TM-align [41]. These proteins were obtained from a curated collection of 40k protein structures [42]. Details behind the model specification and training can be found in Appendix C.

Assessing alignment quality via held out analysis
Alignment accuracy was assessed on a held out test dataset of 79k structural alignments. To determine how well DeepBLAST generalizes, proteins that were in both the heldout LSTM PFam dataset [11] and the held out TM-align alignments used to train DeepBLAST were analyzed. Within the DeepBLAST held out dataset, 57444 alignments were constructed from proteins that were unique to the DeepBLAST held out dataset, 1853 alignments contained proteins that were similar to those trained from the LSTM PFam training dataset and 19967 alignments contained a single protein that was unique to the DeepBLAST held out dataset and a single protein that was in the LSTM PFam training dataset. To evaluate the accuracy of the alignments, precision and recall were computed from the number of correctly identified matching residues. Since each alignment can be represented as a Figure 2: Alignment between two validation proteins: An example of an exact alignment between two proteins with very significant structural/fold similarity used to derive the ground truth alignment that has little sequence similarity.
bipartite graph where the edges represents matching residues between two proteins, precision and recall can be extracted from comparing the edge sets between the predicted alignment and the ground truth alignments. Figure 6 shows the distribution of correctly identified alignment edges.
Within the DeepBLAST held out dataset, the true positive distribution of proteins held out from the training roughly resembles the true positive distribution of proteins observed in pre-training. The average true positive rate, false positive rate and false discovery rate are shown in Table 2.
As expected, DeepBLAST performs best with the TM-align structural alignments on sequences that have been used for training the LSTM language model. This is observed in the true positive rate in addition to the false positive and false negative rates, as shown in Table 2. Thus, it appears that the generalization of DeepBLAST primarily hinges on the underlying language model, as suggested by Rao et al [31].

Manually Curated Structural Alignment Benchmarks
We benchmarked DeepBLAST against three sequence alignment methods, Needleman-Wunsch, BLAST and HMMER in addition to four structural alignment methods that work directly with the atomic coordinates, namely FAST, TM-align, Dali and Mammoth-local. TM-align emphasizes achieving the simultaneous maximal 3D spatial overlap of the atoms in each protein. Conversely, the local structure alignment scores feasible residue pairings between the proteins according to structural similarity of just seven-contiguous-neighbor windows, apropos to a remote homology philosophy where the full length structure is allowed to be flexible and so does not require all the aligned atoms to overlap simultaneously after a rigid body orientation. Dali utilizes distance matrix computed from hexapeptide contacts to align the two protein structures. FAST tries to preserve similar residue-residue contact patterns. We extracted the local structure alignment from first phase of the Mammoth algorithm.
Thus from emphasizing long-range overlap, contacts, and local-window similarity, these reference algorithms span the rational disagreement across different expert opinions for the most meaningful structure alignment considering only backbone atomic coordinates (C-alpha or C-beta atoms). All of those algorithms disregard sequence similarity. Our method uses sequence alone; we do not supply the atomic coordinates of either protein to the algorithm after training it.
To form a common reference for all nine definitions of the optimal alignment, we designated a gold standard to be the manually curated structural alignments. Manual structure alignment is intuitive human assessment typically emphasizing 3D overlap and topology preservation since those features are easier to visualize than a plethora of local alignments and contacts [43,44,45].  Table 1: Malisam and Malidup Benchmarks. Sequence and structure alignment methods measured by their F1 score. Fast, TM-align, Dali Mammoth-local are structure-structure alignment methods and provide an structure-informed upper bound for this benchmark, as many of the most challenging alignments in this benchmark are ultimately structure-derived or curated with a structure-structure alignment as an oracle.
All methods tend to agree when the problem is trivial due to near sequence identity and thus near structural identity. Therefore the most valuable gold-standard is where the dataset members have low sequence identity as well as varied degrees of structural similarity. In that regime, human intuition can provide an informative baseline by accessing additional evolutionary knowledge. Our benchmarks were performed on the curated Malisam [46] and Malidup [47] protein structural alignment benchmarking datasets.
As shown in Table 1, we observe that DeepBLAST outperforms all of the sequence alignment models by a large margin. This is observed in terms of both precision and recall as shown in Figure 8. In both benchmarks, the sequence similarity between proteins was below the observed detection limit for both BLAST and HMMER. As a result, these tools were not able to detect the vast majority of the alignments. This leaves Needleman-Wunsch as the baseline for sequence alignment methods.
However, there is no one definition of what the best structural alignment is [25,48]. This task becomes increasingly ambiguous as the remoteness of the homolog increases and the number of homologous residues declines. Thus two sequence alignments might slightly disagree but still be equally good in terms of structural superposition. Thus the above F1 score is indicative of alignment accuracy but is rigid since it only scores sequence alignments exactly matching the reference.
A better measure than the true positive rate is to directly measure the degree of structural overlap of two proteins given a specified alignment. The TM-score is one calibrated measure of this that factors out the dependence of the number of partially superimposed residues on the length of the protein. Figure 3 displays the TM-scores for multiple accepted criteria for superimpose 3D structures (Dali, TM-align, Fast). Each is plotted against the TM-score of a human curated manual superposition. The scatter in these points represents the reasonable disagreement among these varied structural similarity criteria, since all are arguably good methods. One can even observe that TM-align and Dali actually achieve slightly higher TM-scores than the supposedly ideal manual curation, highlighting the uncertainty in the best structural superposition. It can be seen that all of the structure aware methods agree at high structural similarity, TM-score=1 being perfect superposition of all atoms, but disagree increasingly as the TM-score declines.
To determine the agreement between sequence alignment methods and direct alignment by known structure, the TM-score was calculated for the predicted alignment. Both sequence alignment methods under perform the structure aware methods in terms of their TM-scores. However it is apparent that DeepBLAST is nearly always able to superimpose structures better than Needleman-Wunsch. We also compared these to the scores generated by using just local secondary structure to perform the alignment. This is essentially a Needleman-Wunsch alignment using local structural similarity to determine the quality of the alignment based on the structures. Figure 3, DeepBLAST is competitive with Mammoth-local while consistently outperforming Needleman-Wunsch. We suspect that a large part of the disagreement between DeepBLAST and the structural alignment methods could be explained by the different strategies towards handling Figure 3: Visualization of Malisam and Malidup structural alignments. Each point represents an alignment, crosses are sequence based alignment and circles are structure based alignments. The estimated alignment is then superimposed on the structures to estimate the TM-score (Y-axis). The more to the right on the plot the more significant the structural overlap. As a reference, these are plotted against the Manual Alignment TM-score on the X-axis. The orange line derives alignment from measured secondary structure alone, while other structural alignment methods optimize a 3D structural superposition.

As shown in
gaps. DeepBLAST does not have a mechanism that handles affine gaps, which would be expected to cause it to over-align sequence instead of allowing large gaps for insertions or deletions. Figure 7 showcases a DeepBLAST predicted alignment with an affine gap; the long gaps are over-penalized and DeepBLAST is forced to insert intermediate matches in order to obtain an optimal alignment. This is more apparent from the TM-scores and RMS values highlighted in Figure 9; these metrics suggest that DeepBLAST currently has difficulties obtaining accurate global structural alignments, which may be partially attributed to the inability to handle affine gaps. Handling affine alignments in a differentiable dynamic programming framework is currently an outstanding problem and we discuss potential approaches to adding afine gaps, as an area for future work, below. In spite of this shortcoming, there is evidence that DeepBLAST is able to learn structural information from the sequence. From the PSI scores shown in Figure 9, the high confidence alignments predicted by DeepBLAST are largely in agreement with the manually curated structural alignments. Furthermore, the sequence identity scores in Figure 9 reveal that DeepBLAST is able to obtain structural alignments that have less than 25% sequence identity, a known barrier for sequence alignment methods but can be resolved with the known protein structures. All together, these metrics suggest that DeepBLAST can perform local structure alignment.

Conclusion
The major finding of this work is that language model embeddings capture much more of the structural basis for alignment than a purely sequence based alignment when used in an end to end differentiable framework with a structure-alignment based loss. This shows that the objective function being optimized by our method is strongly correlated to the desired objective of capturing the structural basis for alignment that is implicit in the sequence. Our findings show that our proposed method can generalize well on sequences observed by the protein language model and can align sequences where there is little local sequence similarity.
As mentioned above, the spread in the true positive rate for the 4 gold standard structure based alignment should not interpreted as performance differences across structural alignment methods.
Instead, it reveals the widespread disagreement between experts regarding the ground truth structural alignment. While sequence-only DeepBLAST does not perform well compared to these explicit structural alignment methods, the resulting DeepBLAST alignments do agree more with the structural alignment methods than the other sequence-only methods.
One major difference between Needleman-Wunsch and our proposed DeepBLAST algorithm is we are not mainly weighting the alignment according to evolutionary distance. The embeddings learned from DeepBLAST are able to capture position-specific structural hints in the sequence. Given that protein secondary structure and tertiary contacts can be predicted from sequence alone [49,50,51], this not surprising. Furthermore, DeepBLAST can be more readily scalable to large proteins, potentially enabling structural similarity search where ab initio models cannot be built. This is key for enabling more accurate sequence search, since the vast majority of protein sequences do not have known protein structures.
As is stands DeepBLAST is already a better alternative to traditional sequence similarity alignment, and is usable as such. Moreover, having validated the hypothesis that the language-model unsupervised training embeds structural attributes we can refine and adapt this signal for future applications, such as function prediction and protein structure prediction. Conveniently, the end-to-end differentiable design facilitates retraining for each new objective.
The method seamlessly and continuously bridges both the sequence clues and the structural inferences. What this paper establishes is that this combined sequence/structure feature space exists, can be learned from language models, and is robust on held-out data. Our confidence in this conclusion is high because this signal was found even in gold standards chosen for resistance to sequence alignment.

Code Availability
Our software and analyses can be found on Zenodo at 10.5281/zenodo.4117030.

Appendices A Differentiable Needleman-Wunsch Algorithm
Recall from Equation 1 the recursion behind Needleman-Wunsch is given by This derivative formulation will also enable backprogation to downstream parameters. Letting θ i,j = (µ i,j , µ i,j + g i,j , µ i,j + g i,j ), ω i,j can also be given by ω i,j = ∂vi,j ∂θi,j . The computation of the alignment scores and traceback matrices for the differentiable Needleman-Wunsch algorithm is given in Algorithm 1. Insertion states (x), match states (m) and deletion states (y) have all been color coded red, green and blue respectively. As proposed by Mensch et al, e i,j can be interpreted as elements in an expected alignment. With this in mind, the expected alignment can be compared to the ground truth alignment to estimate the loss as highlighted in Equation 2. Performing backpropagation on this loss function requires the computation of local Hessians on v N,M . We will refrain from providing the derivation of these Hessians, but the computation of the derivatives of v i,j , ω i,j , e i,j are denoted byv i,j ,ω i,j ,ė i,j , which can be computed as shown in Algorithm 2. These Hessians are calculated from directional derivatives that depend on Z, the gradient of the binary cross entropy loss with respect to the variables µ and g.
It is important to note that all of the differentiable dynamic programming is only required for training. When performing prediction, the standard Needleman-Wunsch algorithm can run on the learned scoring matrices µ and g.

C Additional Training Details and Held out Analysis
The final DeepBLAST model consisted of 4 LSTM layers of dimension 512 to parameterize the match embeddings M and gap embeddings G. A 2 layer bidirectional LSTM protein language model pretrained by [11] was used as a precursor step for estimating residue vectors. The resulting model had a total of 100M parameters. We used the ADAM optimizer to train the weights with an initial learning rate of 5 × 10 −5 and the pretrained LSTM model weights were frozen. A batch size of 160 alignments was used for training. DeepBLAST was trained for 10 epochs on 4 Nvidia V100 GPUs for 4 days.
The training dataset consisted of proteins from the PDB [7]. Only proteins that had less than 1000 residues and alignments with a TM-score greater than 0.4 were considered. Furthermore, since only global alignments can be handled, the gaps at the ends of the alignment were trimmed before training. The data was split into 80/10/10 train/validation/test splits.
To evaluate how well DeepBLAST can generalize across unobserved data, the alignment accuracy was evaluated on the DeepBLAST heldout testing dataset. The intersection between the held out PFAM test sequences used to train the LSTM and the heldout TM-align alignments were determined by performing a pairwise alignment between the two held out datasets with BLAST. Alignments whose sequences were both that were detected to be homologous to sequences in the hold out PFAM dataset according to BLAST are labeled as "Both train". Sequences that were only one sequence was found in the hold out PFam dataset are labeled as "One test, one train". Sequences where neither sequence was found in the PFam dataset was labeled as "Both test". As shown in Table 2, DeepBLAST generalizes best on sequences observed in the LSTM pretraining procedure.     TM score measures the superposition agreement between the two aligned protein structures. The oPSI metric measures the fraction aligned residues relative to the smaller protein on the aligned residues predicted to strongly superimposed by the alignment method. The oRMS metric measures the root mean standard deviation of the atomic positions on the aligned residues predicted to strongly superimposed by the alignment method. The oSeq identity score measures the fraction of identical sequence measured over the subset of the sequence alignment that was also aligned structurally by method. All of the alignment metrics are displayed in rank order, and the points represent the manual scores for that given protein, representing and upper or lower bound of the correct alignment.