Skip to main content
bioRxiv
  • Home
  • About
  • Submit
  • ALERTS / RSS
Advanced Search
New Results

Deep embedding and alignment of protein sequences

Felipe Llinares-López, Quentin Berthet, View ORCID ProfileMathieu Blondel, Olivier Teboul, View ORCID ProfileJean-Philippe Vert
doi: https://doi.org/10.1101/2021.11.15.468653
Felipe Llinares-López
Google Research, Brain team, Paris, France
  • Find this author on Google Scholar
  • Find this author on PubMed
  • Search for this author on this site
Quentin Berthet
Google Research, Brain team, Paris, France
  • Find this author on Google Scholar
  • Find this author on PubMed
  • Search for this author on this site
Mathieu Blondel
Google Research, Brain team, Paris, France
  • Find this author on Google Scholar
  • Find this author on PubMed
  • Search for this author on this site
  • ORCID record for Mathieu Blondel
Olivier Teboul
Google Research, Brain team, Paris, France
  • Find this author on Google Scholar
  • Find this author on PubMed
  • Search for this author on this site
Jean-Philippe Vert
Google Research, Brain team, Paris, France
  • Find this author on Google Scholar
  • Find this author on PubMed
  • Search for this author on this site
  • ORCID record for Jean-Philippe Vert
  • For correspondence: jpvert@google.com
  • Abstract
  • Full Text
  • Info/History
  • Metrics
  • Preview PDF
Loading

Abstract

Protein sequence alignment is a key component of most bioinformatics pipelines to study the structures and functions of proteins. Aligning highly divergent sequences remains, however, a difficult task that current algorithms often fail to perform accurately, leaving many proteins or open reading frames poorly annotated. Here, we leverage recent advances in deep learning for language modelling and differentiable programming to propose DEDAL, a flexible model to align protein sequences and detect homologs. DEDAL is a machine learning-based model that learns to align sequences by observing large datasets of raw protein sequences and of correct alignments. Once trained, we show that DEDAL improves by up to two- or three-fold the alignment correctness over existing methods on remote homologs, and better discriminates remote homologs from evolutionarily unrelated sequences, paving the way to improvements on many downstream tasks relying on sequence alignment in structural and functional genomics.

Competing Interest Statement

All authors are employees of Google.

Footnotes

  • Additional results and corrections in the way the benchmark for homology detection and alignment prediction is built.

  • ↵1 As a corollary, UniProtKB sequences containing two or more Pfam-A seed domain sequences that form part of different splits may not be used to generate flanks.

  • ↵2 In practice, we rely on the entries of the pfamA_reg_full_significant, uniprot_reg_full and pfamA_reg_seed tables in Pfam’s SQL database as the source of clan annotations for UniProtKB sequences.

  • ↵3 For the sake of simplicity, we tacitly assume all metrics have been defined so that larger values imply better performance. In practice, we multiply the metric by minus one before normalization and aggregation whenever this is not the case.

  • ↵4 Unlike for all other ablations and the original DEDAL model, we found that disabling the homology loss for the first 300,000 steps was detrimental for the performance of the ablation trained on the original Pfam domains. Thus, we omit this solely for this ablation.

Copyright 
The copyright holder for this preprint is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under a CC-BY 4.0 International license.
Back to top
PreviousNext
Posted July 01, 2022.
Download PDF
Email

Thank you for your interest in spreading the word about bioRxiv.

NOTE: Your email address is requested solely to identify you as the sender of this article.

Enter multiple addresses on separate lines or separate them with commas.
Deep embedding and alignment of protein sequences
(Your Name) has forwarded a page to you from bioRxiv
(Your Name) thought you would like to see this page from the bioRxiv website.
CAPTCHA
This question is for testing whether or not you are a human visitor and to prevent automated spam submissions.
Share
Deep embedding and alignment of protein sequences
Felipe Llinares-López, Quentin Berthet, Mathieu Blondel, Olivier Teboul, Jean-Philippe Vert
bioRxiv 2021.11.15.468653; doi: https://doi.org/10.1101/2021.11.15.468653
Reddit logo Twitter logo Facebook logo LinkedIn logo Mendeley logo
Citation Tools
Deep embedding and alignment of protein sequences
Felipe Llinares-López, Quentin Berthet, Mathieu Blondel, Olivier Teboul, Jean-Philippe Vert
bioRxiv 2021.11.15.468653; doi: https://doi.org/10.1101/2021.11.15.468653

Citation Manager Formats

  • BibTeX
  • Bookends
  • EasyBib
  • EndNote (tagged)
  • EndNote 8 (xml)
  • Medlars
  • Mendeley
  • Papers
  • RefWorks Tagged
  • Ref Manager
  • RIS
  • Zotero
  • Tweet Widget
  • Facebook Like
  • Google Plus One

Subject Area

  • Bioinformatics
Subject Areas
All Articles
  • Animal Behavior and Cognition (4674)
  • Biochemistry (10346)
  • Bioengineering (7658)
  • Bioinformatics (26305)
  • Biophysics (13504)
  • Cancer Biology (10672)
  • Cell Biology (15423)
  • Clinical Trials (138)
  • Developmental Biology (8489)
  • Ecology (12807)
  • Epidemiology (2067)
  • Evolutionary Biology (16835)
  • Genetics (11383)
  • Genomics (15471)
  • Immunology (10603)
  • Microbiology (25186)
  • Molecular Biology (10211)
  • Neuroscience (54395)
  • Paleontology (400)
  • Pathology (1667)
  • Pharmacology and Toxicology (2889)
  • Physiology (4334)
  • Plant Biology (9237)
  • Scientific Communication and Education (1586)
  • Synthetic Biology (2556)
  • Systems Biology (6774)
  • Zoology (1461)