PT - JOURNAL ARTICLE AU - Vikram Agarwal AU - Jay Shendure TI - Predicting mRNA abundance directly from genomic sequence using deep convolutional neural networks AID - 10.1101/416685 DP - 2018 Jan 01 TA - bioRxiv PG - 416685 4099 - http://biorxiv.org/content/early/2018/09/13/416685.short 4100 - http://biorxiv.org/content/early/2018/09/13/416685.full AB - Algorithms that accurately predict gene structure from primary sequence alone were transformative for annotating the human genome. Can we also predict the expression levels of genes based solely on genome sequence? Here we sought to apply deep convolutional neural networks towards this goal. Surprisingly, a model that includes only promoter sequences and features associated with mRNA stability explains 59% and 71% of variation in steady-state mRNA levels in human and mouse, respectively. This model, which we call Xpresso, more than doubles the accuracy of alternative sequence-based models, and isolates rules as predictive as models relying on ChIP-seq data. Xpresso recapitulates genome-wide patterns of transcriptional activity and predicts the influence of enhancers, heterochromatic domains, and microRNAs. Model interpretation reveals that promoter-proximal CpG dinucleotides strongly predict transcriptional activity. Looking forward, we propose the accurate prediction of cell type-specific gene expression based solely on primary sequence as a grand challenge for the field.