ABSTRACT
Background The ability to predict which genes will respond to perturbation of a TF’s activity serves as a benchmark for our systems-level understanding of transcriptional regulatory networks. In previous work, machine learning models have been trained to predict static gene expression levels in a given sample by using data from the same or similar conditions, including data on TF binding locations, histone marks, or DNA sequence. We report on a different challenge – training machine learning models that can predict which genes will respond to perturbation of a TF without using any data from the perturbed cells.
Results Existing TF location data (ChIP-Seq) from human K562 cells have no detectable utility for predicting which genes will respond to perturbation of the TF, but data obtained by newer methods in yeast cells are useful. TF-independent features of genes, including their pre-perturbation expression level and expression variation, are very useful for predicting responses to TF perturbations. This shows that some genes are poised to respond to TF perturbations and others are resistant, shedding significant light on why it has been so difficult to predict responses from binding locations. Certain histone marks (HMs), including H3K4me1 and H3K4me3, have some predictive power, especially when downstream of the transcription start site. In human, the predictive power of HMs is much less than that of gene expression level and variation. Code is available at https://github.com/yiming-kang/TFPertRespExplainer.
Conclusions Sequence-based or epigenetic properties of genes strongly influence their tendency to respond to direct TF perturbations, partially explaining the oft-noted difficulty of predicting responsiveness from TF binding location data. These molecular features are largely reflected in and summarized by the gene’s expression level and expression variation.
Competing Interest Statement
The authors have declared no competing interest.