Abstract
Transcription factors (TFs) and microRNAs (miR-NAs) are fundamental regulators of gene expression, cell state, and biological processes. This study investigated whether a small subset of TFs and miRNAs could accurately predict genome-wide gene expression. We analyzed 8895 samples across 31 cancer types from The Cancer Genome Atlas and identified 28 miRNA and 28 TF clusters using unsupervised learning. Medoids of these clusters could differentiate tissues of origin with 92.8% accuracy, demonstrating their biological relevance. We developed Tissue-Agnostic and Tissue-Aware models to predict 20,000 gene expressions using the 56 selected medoid miR-NAs and TFs. The Tissue-Aware model attained an R2 of 0.70 by incorporating tissue-specific information. Despite measuring only 1/400th of the transcriptome, the prediction accuracy was comparable to that achieved by the 1000 landmark genes. This suggests the transcriptome has an intrinsically low-dimensional structure that can be captured by a few regulatory molecules. Our approach could enable cheaper transcriptome assays and analysis of low-quality samples. It also provides insights into genes that are heavily regulated by miRNAs/TFs versus alternative mechanisms. However, model transportability was impacted by dataset discrepancies, especially in miRNA distribution. Overall, this study demonstrates the potential of a biology-guided approach for robust transcriptome representation.
Competing Interest Statement
The authors have declared no competing interest.
Footnotes
Details of materials and methods were added (Figure 1 and Section 2.2). A new experiment is added (Figure 5) A new coauthor (Heather Pua) is added, who helped with the biological interpretation of findings. All experiments were rerun to investigate the effect of normalization (RPKM vs TPM), and figures are added in the supplement (TPM experiments in Figures S1, S2, and S3)