Skip to main content
bioRxiv
  • Home
  • About
  • Submit
  • ALERTS / RSS
Advanced Search
New Results

High-throughput deep learning variant effect prediction with Sequence UNET

View ORCID ProfileAlistair S. Dunham, View ORCID ProfilePedro Beltrao, View ORCID ProfileMohammed AlQuraishi
doi: https://doi.org/10.1101/2022.05.23.493038
Alistair S. Dunham
1European Molecular Biology Laboratory, European Bioinformatics Institute (EMBL-EBI), Wellcome Genome Campus, Hinxton, Cambridgeshire CB10 1SD, UK
2Wellcome Sanger Institute, Wellcome Genome Campus, Hinxton, Saffron Walden CB10 1RQ
  • Find this author on Google Scholar
  • Find this author on PubMed
  • Search for this author on this site
  • ORCID record for Alistair S. Dunham
  • For correspondence: ad44@sanger.ac.uk
Pedro Beltrao
1European Molecular Biology Laboratory, European Bioinformatics Institute (EMBL-EBI), Wellcome Genome Campus, Hinxton, Cambridgeshire CB10 1SD, UK
3Department of Biology, Institute of Molecular Systems Biology, ETH Zurich, Zurich 8093, Switzerland
  • Find this author on Google Scholar
  • Find this author on PubMed
  • Search for this author on this site
  • ORCID record for Pedro Beltrao
Mohammed AlQuraishi
4Department of Systems Biology, Columbia University, New York, NY, USA
  • Find this author on Google Scholar
  • Find this author on PubMed
  • Search for this author on this site
  • ORCID record for Mohammed AlQuraishi
  • Abstract
  • Full Text
  • Info/History
  • Metrics
  • Supplementary material
  • Data/Code
  • Preview PDF
Loading

Abstract

Understanding the consequences of protein coding mutations is important for many applications in biology and medicine. The vast number of possible mutations across species makes comprehensive experimental characterisation impossible, even with recent high-throughput techniques, which means computationally predicting the consequences of variation is essential for many analyses. Previous variant effect prediction (VEP) tools, generally based on evolutionary conservation and protein structure, are often computationally intensive, making them difficult to scale and limiting potential applications. Recent developments in deep learning techniques, including protein language models, and biological data scale have led to a new generation of predictors. These models have improved prediction performance but are still often intensive to run because of slow training steps, hardware requirements and large model sizes. In this work we introduce a new highly scalable deep learning architecture, Sequence UNET, that classifies and predicts variant frequency directly from protein sequence. This model learns to build representations of protein sequence features at a range of scales using a fully convolutional U-shaped compression/expansion architecture. We show that it can generalise to pathogenicity prediction, achieving comparable performance on ClinVar to methods including EVE and ESM-1b at greatly reduced computational cost. We further demonstrate its scalability by analysing the consequences of 8.3 billion variants in 904,134 proteins detected in a large-scale proteomics analysis, showing a link between conservation and protein abundance. Sequence UNET can be run on modest hardware through an easy to use Python package.

Competing Interest Statement

The authors have declared no competing interest.

Footnotes

  • https://www.ebi.ac.uk/biostudies/studies/S-BSST732

  • https://github.com/allydunham/sequence_unet

  • https://github.com/allydunham/proteinnetpy

Copyright 
The copyright holder for this preprint is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under a CC-BY 4.0 International license.
Back to top
PreviousNext
Posted May 24, 2022.
Download PDF

Supplementary Material

Data/Code
Email

Thank you for your interest in spreading the word about bioRxiv.

NOTE: Your email address is requested solely to identify you as the sender of this article.

Enter multiple addresses on separate lines or separate them with commas.
High-throughput deep learning variant effect prediction with Sequence UNET
(Your Name) has forwarded a page to you from bioRxiv
(Your Name) thought you would like to see this page from the bioRxiv website.
CAPTCHA
This question is for testing whether or not you are a human visitor and to prevent automated spam submissions.
Share
High-throughput deep learning variant effect prediction with Sequence UNET
Alistair S. Dunham, Pedro Beltrao, Mohammed AlQuraishi
bioRxiv 2022.05.23.493038; doi: https://doi.org/10.1101/2022.05.23.493038
Reddit logo Twitter logo Facebook logo LinkedIn logo Mendeley logo
Citation Tools
High-throughput deep learning variant effect prediction with Sequence UNET
Alistair S. Dunham, Pedro Beltrao, Mohammed AlQuraishi
bioRxiv 2022.05.23.493038; doi: https://doi.org/10.1101/2022.05.23.493038

Citation Manager Formats

  • BibTeX
  • Bookends
  • EasyBib
  • EndNote (tagged)
  • EndNote 8 (xml)
  • Medlars
  • Mendeley
  • Papers
  • RefWorks Tagged
  • Ref Manager
  • RIS
  • Zotero
  • Tweet Widget
  • Facebook Like
  • Google Plus One

Subject Area

  • Bioinformatics
Subject Areas
All Articles
  • Animal Behavior and Cognition (4369)
  • Biochemistry (9543)
  • Bioengineering (7068)
  • Bioinformatics (24765)
  • Biophysics (12559)
  • Cancer Biology (9923)
  • Cell Biology (14296)
  • Clinical Trials (138)
  • Developmental Biology (7929)
  • Ecology (12073)
  • Epidemiology (2067)
  • Evolutionary Biology (15952)
  • Genetics (10901)
  • Genomics (14704)
  • Immunology (9841)
  • Microbiology (23580)
  • Molecular Biology (9453)
  • Neuroscience (50691)
  • Paleontology (369)
  • Pathology (1535)
  • Pharmacology and Toxicology (2674)
  • Physiology (3996)
  • Plant Biology (8638)
  • Scientific Communication and Education (1505)
  • Synthetic Biology (2388)
  • Systems Biology (6413)
  • Zoology (1344)