Abstract
Transcriptional regulation, involving the complex interplay between regulatory sequences and proteins, directs all biological processes. Computational models of transcription lack generalizability to accurately extrapolate in unseen cell types and conditions. Here, we introduce GET, an interpretable foundation model designed to uncover regulatory grammars across 213 human fetal and adult cell types. Relying exclusively on chromatin accessibility data and sequence information, GET achieves experimental-level accuracy in predicting gene expression even in previously unseen cell types. GET showcases remarkable adaptability across new sequencing platforms and assays, enabling regulatory inference across a broad range of cell types and conditions, and uncovering universal and cell type specific transcription factor interaction networks. We evaluated its performance on prediction of regulatory activity, inference of regulatory elements and regulators, and identification of physical interactions between transcription factors. Specifically, we show GET outperforms current models in predicting lentivirus-based massive parallel reporter assay readout with reduced input data. In fetal erythroblasts, we identify distal (>1Mbp) regulatory regions that were missed by previous models. In B cells, we identified a lymphocyte-specific transcription factor-transcription factor interaction that explains the functional significance of a leukemia-risk predisposing germline mutation. In sum, we provide a generalizable and accurate model for transcription together with catalogs of gene regulation and transcription factor interactions, all with cell type specificity.
Competing Interest Statement
A US provisional patent with application number 63/486,855 has been filed by Columbia University on using the method developed in the manuscript to identify gene regulatory elements and altering gene regulation and expression, on which X.F. and R.R. are inventors.
Footnotes
We explicitly summarize the biological insights derived from our method, including the identification of novel transcription factor interactions (with experimental validation of TFAP2A-ZFX and PAX5-NR/3 interactions), elucidation of a TF germline mutation in B-cell precursor acute lymphoblastic leukemia (B-ALL), and the method's transferability for expression prediction and interpretability in glioblastoma. We provide experimental validation of the PAX5-NR2C2 interaction using BioID and Co-IP assays in a B-ALL cell line, demonstrating that the G183S mutation alters this interaction, linked to a distinct transcriptional program in 141 B-ALL patients. We show the transferability of GET to new datasets, including TF-perturbed hESC multiome data (SHARE-seq), 10x multiome glioblastoma data (NCI Human Tumor Atlas Network), and bulk chromatin profiles (OmniATAC-seq and CAGE-seq). We enable a parameter-efficient finetuning procedure for transfer learning on a desktop GPU within 30 minutes, resulting in a smaller adaptor for dataset-specific prediction. We extensively benchmark against Enformer (DeepMind) and HyenaDNA (Stanford), demonstrating GET's superior performance in distal prediction and improved performance in enhancer prioritization (K562 cell line) and gene expression prediction (mRNA-seq and CAGE) across various cell types, including outperforming Enformer on the CAGE K562 track and ATAC-seq peak prediction tasks.