Abstract
Gene expression is regulated by transcription factors that work together to read cis-regulatory DNA sequences. The “cis-regulatory code” - the rules that cells use to determine when, where, and how much genes should be expressed - has proven to be exceedingly complex, but recent advances in the scale and resolution of functional genomics assays and Machine Learning have enabled significant progress towards deciphering this code. However, we will likely never solve the cis-regulatory code if we restrict ourselves to models trained only on genomic sequences; regions of homology can easily lead to overestimation of predictive performance, and there is insufficient sequence diversity in our genomes to learn all relevant parameters. Fortunately, randomly synthesized DNA sequences enable us to test a far larger sequence space than exists in our genomes in each experiment, and designed DNA sequences enable a targeted query of the sequence space to maximally improve the models. Since cells use the same biochemical principles to interpret DNA regardless of its source, models that are trained on these synthetic data can predict genomic activity, often better than genome-trained models. Here, we provide an outlook on the field, and propose a roadmap towards solving the cis-regulatory code by training models exclusively on non-genomic DNA sequences, and using genomic sequences solely for evaluating the resulting models.
Competing Interest Statement
The authors have declared no competing interest.