RT Journal Article SR Electronic T1 Large-scale DNA-based phenotypic recording and deep learning enable highly accurate sequence-function mapping JF bioRxiv FD Cold Spring Harbor Laboratory SP 2020.01.23.915405 DO 10.1101/2020.01.23.915405 A1 Simon Höllerer A1 Laetitia Papaxanthos A1 Anja Cathrin Gumpinger A1 Katrin Fischer A1 Christian Beisel A1 Karsten Borgwardt A1 Yaakov Benenson A1 Markus Jeschek YR 2020 UL http://biorxiv.org/content/early/2020/01/24/2020.01.23.915405.abstract AB Predicting quantitative effects of gene regulatory elements (GREs) on gene expression is a longstanding challenge in biology. Machine learning models for gene expression prediction may be able to address this challenge, but they require experimental datasets that link large numbers of GREs to their quantitative effect. However, current methods to generate such datasets experimentally are either restricted to specific applications or limited by their technical complexity and error-proneness. Here we introduce DNA-based phenotypic recording as a widely applicable and practical approach to generate very large datasets linking GREs to quantitative functional readouts of high precision, temporal resolution, and dynamic range, solely relying on sequencing. This is enabled by a novel DNA architecture comprising a site-specific recombinase, a GRE that controls recombinase expression, and a DNA substrate modifiable by the recombinase. Both GRE sequence and substrate state can be determined in a single sequencing read, and the frequency of modified substrates amongst constructs harbouring the same GRE is a quantitative, internally normalized readout of this GRE’s effect on recombinase expression. Using next-generation sequencing, the quantitative expression effect of extremely large GRE sets can be assessed in parallel. As a proof of principle, we apply this approach to record translation kinetics of more than 300,000 bacterial ribosome binding sites (RBSs), collecting over 2.7 million sequence-function pairs in a single experiment. Further, we generalize from these large-scale datasets by a novel deep learning approach that combines ensembling and uncertainty modelling to predict the function of untested RBSs with high accuracy, substantially outperforming state-of-the-art methods. The combination of DNA-based phenotypic recording and deep learning represents a major advance in our ability to predict quantitative function from genetic sequence.