Evolutionarily informed deep learning methods: Predicting transcript abundance from DNA sequence

Jacob D. Washburn; Maria Katherine Mejia-Guerra; Guillaume Ramstein; Karl A. Kremling; Ravi Valluru; Edward S. Buckler; Hai Wang

doi:10.1101/372367

ABSTRACT

Deep learning methodologies have revolutionized prediction in many fields, and show potential to do the same in molecular biology and genetics. However, applying these methods in their current forms ignores evolutionary dependencies within biological systems and can result in false positives and spurious conclusions. We developed two novel approaches that account for evolutionary relatedness in machine learning models: 1) gene-family guided splitting, and 2) ortholog contrasts. The first approach accounts for evolution by constraining the models training and testing sets to include different gene families. The second, uses evolutionarily informed comparisons between orthologous genes to both control for and leverage evolutionary divergence during the training process. The two approaches were explored and validated within the context of mRNA expression level prediction, and have prediction auROC values ranging from 0.72 to 0.94. Model weight inspections showed biologically interpretable patterns, resulting in the novel hypothesis that the 3’ UTR is more important for fine tuning mRNA abundance levels while the 5’ UTR is more important for large scale changes.

The copyright holder for this preprint is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. This article is a US Government work. It is not subject to copyright under 17 USC 105 and is also made available for use under a CC0 license.