Abstract
Protein sequence alignment is a key component of most bioinformatics pipelines to study the structures and functions of proteins. Aligning highly divergent sequences remains, however, a difficult task that current algorithms often fail to perform accurately, leaving many proteins or open reading frames poorly annotated. Here, we leverage recent advances in deep learning for language modelling and differentiable programming to propose DEDAL, a flexible model to align protein sequences and detect homologs. DEDAL is a machine learning-based model that learns to align sequences by observing large datasets of raw protein sequences and of correct alignments. Once trained, we show that DEDAL improves by up to two- or three-fold the alignment correctness over existing methods on remote homologs, and better discriminates remote homologs from evolutionarily unrelated sequences, paving the way to improvements on many downstream tasks relying on sequence alignment in structural and functional genomics.
Competing Interest Statement
All authors are employees of Google.
Footnotes
Additional results and corrections in the way the benchmark for homology detection and alignment prediction is built.
↵1 As a corollary, UniProtKB sequences containing two or more Pfam-A seed domain sequences that form part of different splits may not be used to generate flanks.
↵2 In practice, we rely on the entries of the pfamA_reg_full_significant, uniprot_reg_full and pfamA_reg_seed tables in Pfam’s SQL database as the source of clan annotations for UniProtKB sequences.
↵3 For the sake of simplicity, we tacitly assume all metrics have been defined so that larger values imply better performance. In practice, we multiply the metric by minus one before normalization and aggregation whenever this is not the case.
↵4 Unlike for all other ablations and the original DEDAL model, we found that disabling the homology loss for the first 300,000 steps was detrimental for the performance of the ablation trained on the original Pfam domains. Thus, we omit this solely for this ablation.