Skip to main content
bioRxiv
  • Home
  • About
  • Submit
  • ALERTS / RSS
Advanced Search
New Results

Deciphering regulatory DNA sequences and noncoding genetic variants using neural network models of massively parallel reporter assays

Rajiv Movva, Peyton Greenside, Georgi K. Marinov, Surag Nair, Avanti Shrikumar, Anshul Kundaje
doi: https://doi.org/10.1101/393926
Rajiv Movva
1The Harker School, San Jose, CA, USA
2Department of Genetics, Stanford University, Stanford, CA, USA
  • Find this author on Google Scholar
  • Find this author on PubMed
  • Search for this author on this site
  • For correspondence: rmovva@mit.edu akundaje@stanford.edu
Peyton Greenside
3Biomedical Informatics Training Program, Stanford University, Stanford, CA, USA
  • Find this author on Google Scholar
  • Find this author on PubMed
  • Search for this author on this site
Georgi K. Marinov
2Department of Genetics, Stanford University, Stanford, CA, USA
  • Find this author on Google Scholar
  • Find this author on PubMed
  • Search for this author on this site
Surag Nair
4Department of Computer Science, Stanford University, Stanford, CA, USA
  • Find this author on Google Scholar
  • Find this author on PubMed
  • Search for this author on this site
Avanti Shrikumar
4Department of Computer Science, Stanford University, Stanford, CA, USA
  • Find this author on Google Scholar
  • Find this author on PubMed
  • Search for this author on this site
Anshul Kundaje
2Department of Genetics, Stanford University, Stanford, CA, USA
4Department of Computer Science, Stanford University, Stanford, CA, USA
  • Find this author on Google Scholar
  • Find this author on PubMed
  • Search for this author on this site
  • For correspondence: rmovva@mit.edu akundaje@stanford.edu
  • Abstract
  • Full Text
  • Info/History
  • Metrics
  • Data/Code
  • Preview PDF
Loading

Abstract

The relationship between noncoding DNA sequence and gene expression is not well-understood. Massively parallel reporter assays (MPRAs), which quantify the regulatory activity of large libraries of DNA sequences in parallel, are a powerful approach to characterize this relationship. We present MPRA-DragoNN, a convolutional neural network (CNN)-based framework to predict and interpret the regulatory activity of DNA sequences as measured by MPRAs. While our method is generally applicable to a variety of MPRA designs, here we trained our model on the Sharpr-MPRA dataset that measures the activity of ~500,000 constructs tiling 15,720 regulatory regions in human K562 and HepG2 cell lines. MPRA-DragoNN predictions were moderately correlated (Spearman ρ = 0.28) with measured activity and were within range of replicate concordance of the assay. State-of-the-art model interpretation methods revealed high-resolution predictive regulatory sequence features that overlapped transcription factor (TF) binding motifs. We used the model to investigate the cell type and chromatin state preferences of predictive TF motifs. We explored the ability of our model to predict the allelic effects of regulatory variants in an independent MPRA experiment and fine map putative functional SNPs in loci associated with lipid traits. Our results suggest that interpretable deep learning models trained on MPRA data have the potential to reveal meaningful patterns in regulatory DNA sequences and prioritize regulatory genetic variants, especially as larger, higher-quality datasets are produced.

Footnotes

  • The model was renamed from 'SNPpet' to 'MPRA-DragoNN'. Some sentences and statements in the text were revised for increased clarity and accuracy. The 'Availability' section was updated to point to a new Github repository with more usable software. Authors added.

  • https://github.com/kundajelab/MPRA-DragoNN/

Copyright 
The copyright holder for this preprint is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under a CC-BY-NC-ND 4.0 International license.
Back to top
PreviousNext
Posted June 07, 2019.
Download PDF
Data/Code
Email

Thank you for your interest in spreading the word about bioRxiv.

NOTE: Your email address is requested solely to identify you as the sender of this article.

Enter multiple addresses on separate lines or separate them with commas.
Deciphering regulatory DNA sequences and noncoding genetic variants using neural network models of massively parallel reporter assays
(Your Name) has forwarded a page to you from bioRxiv
(Your Name) thought you would like to see this page from the bioRxiv website.
CAPTCHA
This question is for testing whether or not you are a human visitor and to prevent automated spam submissions.
Share
Deciphering regulatory DNA sequences and noncoding genetic variants using neural network models of massively parallel reporter assays
Rajiv Movva, Peyton Greenside, Georgi K. Marinov, Surag Nair, Avanti Shrikumar, Anshul Kundaje
bioRxiv 393926; doi: https://doi.org/10.1101/393926
Digg logo Reddit logo Twitter logo Facebook logo Google logo LinkedIn logo Mendeley logo
Citation Tools
Deciphering regulatory DNA sequences and noncoding genetic variants using neural network models of massively parallel reporter assays
Rajiv Movva, Peyton Greenside, Georgi K. Marinov, Surag Nair, Avanti Shrikumar, Anshul Kundaje
bioRxiv 393926; doi: https://doi.org/10.1101/393926

Citation Manager Formats

  • BibTeX
  • Bookends
  • EasyBib
  • EndNote (tagged)
  • EndNote 8 (xml)
  • Medlars
  • Mendeley
  • Papers
  • RefWorks Tagged
  • Ref Manager
  • RIS
  • Zotero
  • Tweet Widget
  • Facebook Like
  • Google Plus One

Subject Area

  • Genomics
Subject Areas
All Articles
  • Animal Behavior and Cognition (3514)
  • Biochemistry (7367)
  • Bioengineering (5346)
  • Bioinformatics (20325)
  • Biophysics (10045)
  • Cancer Biology (7777)
  • Cell Biology (11353)
  • Clinical Trials (138)
  • Developmental Biology (6453)
  • Ecology (9980)
  • Epidemiology (2065)
  • Evolutionary Biology (13356)
  • Genetics (9373)
  • Genomics (12614)
  • Immunology (7725)
  • Microbiology (19103)
  • Molecular Biology (7465)
  • Neuroscience (41153)
  • Paleontology (301)
  • Pathology (1235)
  • Pharmacology and Toxicology (2142)
  • Physiology (3180)
  • Plant Biology (6880)
  • Scientific Communication and Education (1276)
  • Synthetic Biology (1900)
  • Systems Biology (5328)
  • Zoology (1091)