TM-Vec: template modeling vectors for fast homology detection and alignment

Exploiting sequence-structure-function relationships in molecular biology and computational modeling relies on detecting proteins with high sequence similarities. However, the most commonly used sequence alignment-based methods, such as BLAST, frequently fail on proteins with low sequence similarity to previously annotated proteins. We developed a deep learning method, TM-Vec, that uses sequence alignments to learn structural features that can then be used to search for structure-structure similarities in large sequence databases. We train TM-Vec to accurately predict TM-scores as a metric of structural similarity for pairs of structures directly from sequence pairs without the need for intermediate computation or solution of structures. For remote homologs (sequence similarity ≤ 10%) that are highly structurally similar (TM-score ? 0.6), we predict TM-scores within 0.026 of their value computed by TM-align. TM-Vec outperforms traditional sequence alignment methods and performs similar to structure-based alignment methods. TM-Vec was trained on the CATH and SwissModel structural databases and it has been tested on carefully curated structure-structure alignment databases that were designed specifically to test very remote homology detection methods. It scales sub-linearly for search against large protein databases and is well suited for discovering remotely homologous proteins.

(r=0.936, pval< 1 × 10 −5 , median error=0.023) as well as held-out domains (r=0.901, pval< 1 × 10 −5 , 140 median error=0.023) ( Figure 2B, Figure S4B). TM-Vec's prediction errors are highest for pairs with TM-141 scores in the [0.75-1.0] range, and its accuracy declines on held-out folds. However, the incremental increase 142 in the generalization error for proteins in the held-out folds (r=0.781, pval< 1 × 10 −5 , median error=0.042) 143 showcases that TM-Vec is robust to out-of-distribution observations, a critical requirement for extrapolating 144 beyond the experimental structures within the PDB ( Figure 2B, Figure S4B). 145 To further validate this finding, we applied TM-Vec to the Microbiome Immunity project (MIP) [44] 146 that contains 200,000 de novo protein structure predictions from previously unknown proteins, including 147 148 putative novel folds. The correlation between our predictions and the TM-scores from MIP protein pairs 148 with putative novel folds (r=0.785, pval< 1 × 10 −5 ) was surprisingly close to the estimates we observed 149 with the held-out CATH folds. Table S1 shows a confusion matrix for our TM-score predictions, where we 150 observe that TM-Vec had a 99.9% true positive rate for predicting if a pair shared a novel fold (TM-score 151 ≥ 0.5) and a false positive rate of 3.9%. Altogether, these validation benchmarks across SwissProt, CATH 152 and MIP show that TM-Vec is well-suited to detect similarities between proteins with previously unknown 153 protein structures and folds, extending the general utility of this work. 154 Capturing structural information in the latent space 155 We visualize and benchmark the learned representations produced by TM-Vec against an array of alternative 156 methods that depend on either sequence or structure alone. The results of our benchmarks show that TM-(Macro AUPR = 0.79). The performance gap between the pretrained ProtTrans model (Macro AUPR 174 = 0.66) and the fine-tuned model obtained with TM-Vec highlights the importance of fine-tuning with a 175 structure-based objective. Furthermore, the fact that TM-vec outperforms sequence-based representations 176 on the CATH dataset that is clustered at 40% sequence similarity provides evidence that TM-Vec learns qual-177 ity structural features rather than a trivial feature of the underlying data or a function of sequence similarity. has a significant structurally similar region, a manually curated structure-structure alignment, and low 185 sequence similarity that is either below or at the threshold of detection by sequence alignment tools. One 186 of the challenges of benchmarking structural alignment methods is defining the ground truth structural 187 alignment. As shown in Figure 3A, there are subtle disagreements between the manual alignments and the 188 structural alignment methods, highlighting the uncertainty defining the optimal structural alignment. This 189 is highlighted in scenarios where TM-align obtains a better structural superposition compared to the manual 190 alignment (TM-align superimposes more atoms, or a greater extent of backbone regions than the manual 191 alignment). All of the structure-aware methods agree at high structural similarity, TM-score=1 being perfect 192 superposition of all atoms, but increasingly disagree as the TM-score declines. We observe that  is directly comparable to structure-aware methods and the confidence bands for its trend line overlap with 194 the trend lines for TM-align, and Fast ( Figure 3A). Although the trend lines overlap, TM-Vec's prediction 195 errors have a higher variance than those of the structure-aware methods. To determine the agreement 196 between sequence alignment methods and structural alignment methods, the TM-score was calculated for 197 the predicted alignment. While DeepBLAST does not always generalize for divergent proteins, to illustrate 198 an example where our method does obtain correct alignments for highly divergent proteins, we focused 199 on two duplicated Annexin domains with a sequence identity of 24.7%. DeepBLAST accurately aligned 200 these proteins (TM-score=0.75) and 4 of the 5 folds that were superimposed were in agreement with the 201 manual alignment ( Figure 3B-D). In contrast, Needleman-Wusnch was not able to identify any structural 202 similarity between these two proteins (TM-score=0.33). The differences between Needleman-Wunsh and  Figure S5, the high confidence alignments predicted by DeepBLAST are largely in agreement with the 205 manually curated structural alignments. Furthermore, the sequence identity scores in Figure S5 reveal that 206 DeepBLAST is able to obtain structural alignments for pairs that have ≤ 25% sequence identity, a known 207 barrier for sequence alignment methods but can be resolved with the known protein structures. All together, 208 these metrics suggest that DeepBLAST can perform local structural alignment. to encode 50K queries on one GPU within 40 minutes ( Figure S6A). In Figure S6B, we show sub-linear 216 search performance. The search runtime benchmarks for different database sizes show that 50K queries on 217 a database of 5M proteins can be performed within 20 seconds on a single GPU, showcasing that encoding 218 sequences is the computational bottleneck in search. To constrast the TM-Vec query search time to existing 219 sequence-based methods, we compared the TM-Vec query runtimes to Diamond [45] and BLAST. We observe 220 that TM-Vec is not as fast as Diamond, which is optimized for short-reads and is known to have remote 221 homology sensitivity and alignment performance similar to BLAST. TM-Vec does outperform BLAST in all 222 cases, including in modes adapted for scaling TM-Vec described here, and our performance will scale sub-223 linearly with database size ( Figure S6C). For example, TM-Vec achieves a 10X speedup compared to BLAST 224 performing 1,000 queries on a database of 100,000 proteins, and this speedup will increase exponentially as 225 the database size increases: on a 1M protein database there is a 100X speedup. 226 Thus, TM-Vec can be used to carry out full repository searches, large all vs all queries, and can do so with 227 vastly improved remote homology detection and sensitivity. Further gains in computational performance are 228 likely achievable (this work focuses on accuracy and sensitivity with respect to structure-structure quality 229 alignments).

230
Once structurally similar proteins have been identified, structural alignments via the DeepBLAST can 231 identify structurally homologous regions. Our structural alignment runtime benchmarks show that unlike the 232 Needleman-Wunsch CPU implementations, the structural alignment runtime of our differentiable Needleman-

233
Wunsch GPU implementation does not increase linearly with respect to the batch size, showcasing how our 234 method can process multiple alignments in parallel on a single GPU ( Figure S6D). Furthermore, we observe 235 that both the CPU and GPU implementations scale linearly with regards to the length of both proteins, 236 with our GPU implementation consistently yielding a 10X speed up over the CPU implementation. 237 We conducted an analysis of a structurally diverse set of families, bacteriocins, using the BAGEL database 239 [46]. Bacteriocins are peptide-derived molecules produced by bacteria, and often serve the role of antimicro-240 bial peptides to target competing microbial species. They can also be involved in cell-cell communication.

241
Several bacterial species encode bacteriocins and in light of their strong ecological benefits, bacteria are under 242 evolutionary pressure to obfuscate these genes. As a result, bacteriocins showcase substantial sequence and 243 structural diversity and are notoriously difficult to detect using sequence homology tools [47]. To date, less 244 than 1004 bacteriocins have been identified and classified, despite there being trillions of microbial species 245 [48] that have the potential to produce antimicrobial peptides.

246
Previous studies have shown that bacteriocin structures can be characterized by their highly modified 247 polypeptides, suggesting structural cues to identify novel bacteriocins where sequence-similarity approaches 248 fail. Our analysis revealed that TM-Vec can clearly partition bacteriocins according to their BAGEL anno-249 tations ( Figure 4A, Figure 4A). Interestingly, unannotated bacteriocins identified in Morton et al [47] were 250 identified to be structurally similar to Lanthipeptide A and B. In Figure 4C, we compared AlphaFold2 with 251 TM-Vec on this bacteriocin dataset and found that TM-Vec distinguishes bacteriocin classes more accurately 252 than ColabFold in combination with TM-align. Lastly, Figure 4D shows a DeepBLAST alignment for the 3 253 nearest classified bacteriocin neighbors of a putative bacteriocin identified by Hamid et al. [49].

254
In Figure S7A, we found that Non-toxins clearly separated from the different bacteriocin class clusters 255 based on their structural similarity. We further tested our ability to distinguish bacteriocins by training a 256 k-nearest neighbor classifier for bacteriocin classes and Non-toxins; the overall precision and recall of these 257 classifiers were 98% and 93% respectively ( Figure S7B).

259
We have shown that TM-Vec has the potential to close the outstanding gap between protein sequence and 260 structural information by enabling structural alignments from sequence information and remote homology 261 search on repository scale protein sequence databases. From our benchmarks on SwissProt and CATH, 262 TM-Vec can accurately predict the TM-score to quantify the structural similarities across wide-spread struc-263 tural diversity, including remote homologs that fall below the 10% sequence identity threshold. Compared 264 to sequence-based and structure-based methods, TM-Vec can competitively differentiate tiers of the CATH 265 hierarchy, despite not being explicitly trained to classify CATH classes. Furthermore, TM-Vec is able to 266 predict structural similarity close to existing structural similarity methods, while being able to query struc- benchmark, while we observe that there are certain remote homologs that our aligner misses, we consistently 272 outperform sequence alignment methods. When we applied TM-Vec to the BAGEL database, we were able 273 to accurately cluster bacteriocins based on both their class and subclass labels, a task that AlphaFold2 strug-274 gles with. We also were able to confidently annotate 28 putative bacteriocins by finding their nearest class 275 or subclass clusters. Out insights hint at the potential to lower the barrier for natural product discovery.

276
In light of the promising advantages TM-Vec provides over the existing methodologies, there are a few 277 limitations to consider. TM-Vec is not well suited to detect structural differences induced by point mutations.

278
From a benchmark using the VIPUR dataset, TM-Vec was unable to detect structural differences caused by

292
We first describe our model that can produce structure-aware embeddings of protein sequences. We then 293 use this model to build large searchable databases of protein representations that can be queried for finding 294 proteins with similar structures, using only their sequence information. In the last piece of our pipeline, we 295 produce protein structure alignments using sequences alone for the proteins that are predicted to have the 296 most similar structures. The TM-Vec model is trained on pairs of protein sequences and their TM-scores (the measure of protein 301 structure similarity we use), and leverages the latest advances in deep protein language models. When 302 protein sequences are fed into the pipeline, a pretrained deep protein language model ProtTrans ('ProtT5-303 XL-UniRef50') is used to produce embeddings for every residue in the protein [32]. These residue embeddings 304 are then fed into a twin neural network that we train, called ϕ. Figure S1 shows the function ϕ which takes 305 residue embeddings and produces a flattened vector representation for each protein. ϕ is composed of several 306 transformer encoder layers, followed by average pooling, dropout, and fully connected layers. Finally, we 307 calculate the cosine distance between the reduced representations of each protein in the pair, and our training 308 objective is to minimize the L1 distance between the cosine similarity of the pairs' reduced representations, 309 and their TM-score. Therefore, for any pair of protein sequences, a forward pass of our model can predict 310 the pair's TM-score, and can also be used to produce structure-aware embeddings for each protein sequence. In order to build a large database of structure-aware protein embeddings, we start with large databases of 314 protein sequences. Among the databases of protein sequences that we encode are SwissProt, and CATH.

315
After encoding each protein sequence, we build an indexed vector-searchable database of protein embeddings 316 using the FAISS package [34]. When this database is queried with a new sequence, we first embed the protein 317 using a forward pass of the TM-Vec embedding model, and then we return the nearest neighbors of the query 318 according to cosine similarity (the proteins in our database with the highest predicted structural similarity 319 or TM-score). While one of our goals is to return the nearest neighbors in structure space for any query 320 proteins, another goal is to also include the structural alignments for the nearest neighbors with the query 321 protein, using sequences alone.  Protein Language Modeling for alignment In order to obtain an alignment from dynamic programming, the scoring parameters for matches and gaps must be obtained. Here we can use a number of pretrained protein language models to estimate these scoring parameters. These pretrained models ultimately construct a function, mapping a sequence of residues, represented as one-hot encodings, to a set of residue vectors, providing an alternative representation of these proteins. Often these models will learn these representations by being trained to predict randomly masked residues within a protein sequence. Multiple studies have shown the merits of these models when performing protein structure prediction, remote homology and protein design [29, 55, 28, 31, 30, 32, 33]. Here, we have used the pretrained LSTM PFam model from [27]. Using this pretrained language model, two proteins X and Y can be represented by embeddings H X ∈ R p×d and H Y ∈ R q×d , where p and q represent the lengths of proteins X and Y and d is the embedding dimension of the language model. Given these representations, we can construct mappings M and G to obtain match scores and gap scores for the differentiable dynamic programming as follows  [58] and Ofitserov et al [59] suggested that a differentiable Needlemanthe first GPU-accelerated implementation of the differentiable Needleman-Wunsch algorithm.
Previous work [57] has shown that backpropagation can be performed on dynamic programming algorithms by introducing smoothed maximum and argmax functions. Doing so will enable the computation of derivatives while providing a tight approximation to the optimal dynamic programming solution. The traditional Needleman-Wunsch algorithm can be defined with the following recursion where the alignment score v i,j is evaluated on position i in the first sequence X and on position j in the second sequence Y . Sequences X and Y are of lengths n and m respectively. µ i,j represents the log-odds score of residues X i and Y j being aligned and g ij represents the log-odds score of an insertion or a deletion at positions i and j. Due to the structure of dynamic programming problems, v n,m is guaranteed to be the optimal alignment score between the two sequences. Furthermore, the optimal alignment can be obtained by tracing the highest-scoring path through the alignment matrix via argmax operations. As neither the max nor the argmax operations are differentiable, the alignment scores and the traceback cannot be differentiated in the traditional formulation of the traceback operations needed to generate alignments. Accordingly, Mensch et al [57] introduced smoothed differentiable operators where the smooth max operator max Ω (x) is given by the log sum exp function and the smoothed argmax Ω (x) 344 is given by the softmax function. Since the softmax function can be derived from the derivative of max Ω , 345 the traceback matrix can also obtained by differentiating the resulting alignment matrix. The resulting 346 traceback matrix will yield the expected alignment between the two proteins.

358
Alignment Loss Function By defining a loss function between the predicted alignment and the structural alignment from TM-align, we can evaluate the accuracy of DeepBLAST and fine-tune the functions M and G. Mensch et al [57] proposed using the Euclidean distance between the predicted and ground truth alignments as a loss function. In practice, we found that a cross-entropy loss provided more reasonable alignment results. This loss is given by where e * is the ground truth alignment and e is the predicted alignment. As shown in [57], the predicted Protein chain pairs dataset The model that we ultimately use to encode protein sequences, is trained 373 on pairs of protein chains. We sample pairs of chains from SwissModel, which has over 500K chains in it. 374 We filter out protein chains that are longer than 300 residues, and are left with 277K chains. With these 375 chains we make pairs, we also make sure to oversample pairs of proteins with similar folds, which we do by 376 using information from Gene3D about the predicted domains within protein chains. The dataset that we 377 end up with has 150 million pairs of protein chains. We run TM align for each one of these pairs of protein 378 chains by using their structures from the PDB. We pull out the TM-scores and sequence identity for every and Yun Song. Evaluating protein transfer learning with tape. In Advances in Neural Information Figure 1: (a) The TM-Vec pipeline consists of two stages: retrieval and alignment. First, TM-Vec takes a query protein sequence and rapidly retrieves proteins that are predicted to have similar structures (TMscores) to the query. Then, DeepBLAST produces alignments for the proteins with the highest predicted structural similarity. (b) TM-Vec is trained on pairs of amino acid sequences and their TM-scores. We first input a pair of sequences (i.e. domains, chains, proteins), and use a pretrained deep protein language model to extract embeddings for every residue of the sequence. Next, we apply a twin neural network, called ϕ, to the embeddings of each sequence, and produce a vector representation, z, for each sequence. The ϕ network is trained on millions of pairs of sequences and its architecture is detailed in Figure S1. Finally, we compute the cosine similarity of the vector representations, which is our prediction for the TM-score (structural similarity) of the pair. (c) We build a TM-Vec database by encoding large databases of protein sequences using a trained TM-Vec model. As an example, we input the sequences from SwissProt, extract vector representations for every sequence, and finally build an indexed database of TM-Vec's structure-aware vector representations of proteins. (d) Demonstration of protein structure search using the TM-Vec pipeline.
As the indexed database of vector representations has already been built, protein search consists of first encoding the query sequence using the trained TM-Vec model, and then performing fast vector search and TM-score prediction using cosine similarity as the search metric. As search results, we return the k-nearest neighbors with the highest predicted structural similarity (TM-score) with the query sequence. (e) As a last step, we apply DeepBLAST to produce structural alignments for the k-nearest neighbors to a query sequence.  Mammoth-local are structure-structure alignment methods and provide an structure-informed upper bound for this benchmark, as many of the most challenging alignments in this benchmark are ultimately structure-derived or curated with a structure-structure alignment as an oracle. whereas Fast, Dali, TM-align are structural alignment methods. Y-axis represents the predicted TM-score given a predicted alignment and the X-axis represents the TM-score from a manually curated alignment. TM-Vec performs comparable to structural alignment methods and outperforms Needleman-Wunsch. b) A predicted alignment of two duplicated Annexin domains from Malidup where DeepBLAST can accurately align (TM-score=0.75) and Needleman-Wunsch struggles with (TM-score=0.33). c) The manual alignment of the two duplicated Annexin domains with the agreement with DeepBLAST highlighted. Visualization of the manual structural alignment of the Malidup with the chains that DeepBLAST aligned correctly highlighted in yellow. Table S1: Predicting TM-scores for novel folds discovered by the Microbiome Immunity project (MIP) There was a Pearson correlation of .786 between TM-Vec's TM-score predictions and the known TM-scores for pairs of proteins from new, undiscovered folds discovered by the MIP project. This correlation is nearly identical to the correlation of 0.78 that we got predicting on a test dataset of left out folds from CATH. Table S2: TM-Vec search accuracy on CATH datasets across multiple tiers with varying sequence similarity. Across every tier of CATH, and for two CATH datasets, CATH S40 (clusters thresholded at 40% sequence similarity) and CATH S100 (thresholded at 100% sequence similarity), we show the classification accuracy of the nearest neighbors returned by TM-Vec. For example, the Top 1 accuracy for Topology in CATH S100 is 97.7%, which means that 97.7% of the time, the nearest neighbor returned using TM-Vec is in the same fold as the query domain's fold. As another example, the Top 3 column for Topology indicates the percentage of the time that one of the top 3 nearest neighbors returned by TM-Vec shares the same fold as the query domain's fold.  : Annotating and aligning unclassified putative bacteriocins using TM-Vec. a) Visualizing TM-Vec embeddings using t-SNE for 3 classes of bacteriocins in addition to 28 unclassified putative bacteriocins. For 94% of the annotated bacteriocins, the nearest neighbor to a classified bacteriocin is in the same bacteriocin class. b) Visualization of Class 1 bacteriocins by subclass, highlighting how TM-Vec can recover multiple levels of manual annotation without protein structures. c) A comparison of TM-Vec's TM-score predictions with the TM-scores produced by running TM-align on ColabFold predicted structures for every pair of bacteriocins. Using a TM-score of 0.5 as a structural similarity cutoff, TM-Vec is able to distinguish pairs of proteins that are in the same class versus different classes and in the same subclass for class 1 bacteriocins, while TM-align on ColabFold structures is not. d) DeepBLAST alignments for a putative bacteriocin, YP 006656667, and its 3 nearest neighbors in embedding space (i.e. they have the highest predicted TMscores). Figure S1: Overview of TM-Vec neural network architecture. The function ϕ takes in residue embeddings and produces a flattened vector representation for each protein. ϕ is composed of several transformer encoder layers, followed by average pooling, dropout, and fully connected layers. At the final step, we calculate the cosine distance between the vector representations of each protein in the pair, and our training objective is to minimize the L1 distance between the cosine similarity of the pairs vector representations, z, and their TM-score. Figure S2: Overview of aligner pipeline. Proteins X and Y are fed into the pretrained LSTM protein language model [27] to obtain embeddings H X and H Y . These residue-level embeddings are then propagated through the match embeddings (M) and gap embeddings (G) in order to obtain the match scores µ and the gap scores g. The match and gap scores are used to evaluate the differentiable dynamic programming algorithm and generate a predicted alignment traceback.  Figure S4: a) SwissProt cumulative Pearson correlations between the known TM-scores and predicted TMscores for pairs of sequences at different sequence identity thresholds. The correlation coefficient at a particular value represents the correlation coefficient for SwissProt test pairs below that sequence identity threshold. b) CATH S40 predicted TM-scores versus ground truth TM-scores for different test datasets (Pairs, Domains, Folds). Contour plots show the density of points, and we also show the trend line for the relationship and Pearson correlation. Figure S5: Comparison of DeepBLAST and Needleman-Wunsch on the Malidsam and Malidup benchmark. TM-score measures the superposition agreement between the two aligned protein structures. The oPSI metric measures the fraction aligned residues relative to the smaller protein on the aligned residues predicted to strongly superimposed by the alignment method. The oRMS metric measures the root mean standard deviation of the atomic positions on the aligned residues predicted to strongly superimposed by the alignment method. The oSeq identity score measures the fraction of identical sequence measured over the subset of the sequence alignment that was also aligned structurally by method. All of the alignment metrics are displayed in rank order, and the points represent the manual scores for that given protein, representing and upper or lower bound of the correct alignment. Figure S6: a) We evaluated how long it takes to encode query sequences using TM-Vec without parallelization on 1 GPU. The query size is on the x axis, and the encoding time in seconds is on the y axis. Encoding queries is linear in time. b) We evaluated how long it takes to search an indexed TM-Vec database and return the nearest neighbors once the queries have been encoded. The query size is on the x axis (number of sequences) and the query time is on the y axis (number of seconds) for databases of different sizes ranging from 10K sequences to 5M sequences. Search time is trivial relative to the time it takes to encode queries, and the size of the database does not materially impact speed. c) Search and alignment speed for TM-Vec is compared against BLAST, and Diamond for different query and database sizes. d) CPU vs GPU Differentiable Needleman-Wunsch benchmarks. The batch-size benchmark was run with randomized proteins of length 800 and the length benchmark was run with a fixed batch size of 64. Figure S7: a) T-sne visualization of embeddings for both bacteriocins and non-toxin proteins. Clearly there is a separation of bacteriocins and non-bacteriocins. b) Confusion matrix for predicting the bacteriocin status/class for a held out test set of proteins. 85% of data was used to train a k-nearest neighbors classifier with k equal to 3 neighbors. The overall precision and recall on the held-out test set were 0.98 and 0.93, respectively. Figure S8: Distribution of TM-Vec true positive rates across heldout alignment datasets.