TY - JOUR T1 - Multivariate information maximization yields hierarchies of expression components in tumors that are both biologically meaningful and prognostic JF - bioRxiv DO - 10.1101/043257 SP - 043257 AU - Shirley Pepke AU - Greg Ver Steeg Y1 - 2016/01/01 UR - http://biorxiv.org/content/early/2016/03/11/043257.abstract N2 - De novo inference of clinically relevant gene function relationships from tumor RNA-seq remains a challenging task. In this work we show that Correlation Explanation (CorEx), a recently developed machine learning algorithm that optimizes over multivariate mutual information, achieves significant progress toward this goal. CorEx utilizes high dimensional correlations for a principled construction of relatively independent latent factors that “explain” dependencies in gene expression among samples. Using only ovarian tumor RNA-seq, CorEx infers gene cohorts with related function, recapitulating Gene Ontology annotation relationships. CorEx is able to identify latent factors that capture dependencies in groups of genes whose expression patterns correlate with patient survival in ovarian cancer. Some inferred pathways such as chemokine signaling and FGF signaling have been implicated previously in chemo responsiveness, but novel survival-associated groups are identified as well. These include a pathway connected with the epithelial-mesenchymal transition in breast cancer that is regulated by a potentially druggable microRNA. Further, it is seen that combinations of factors lead to a synergistic survival advantage in some cases. Comparison to normal ovarian tissue exhibits substantial differences between cancerous and non-cancerous samples related to a variety of cellular processes. In contrast to studies that attempt to partition patients into a small number of subtypes (typically 4 or fewer), our approach utilizes subgroup information for combinatoric transcriptional phenotyping. Considering only the 66 gene expression groups that are found to have significant Gene Ontology enrichment and are also small enough to indicate specific drug targets implies a computational phenotype for ovarian cancer that allows for 366 possible patient profiles, enabling truly personalized treatment. The findings here demonstrate a new technique that sheds light on the complexity of gene expression dependencies in tumors and could eventually enable the use of patient RNA-seq profiles for selection of personalized and effective cancer treatments.Author Summary Quantifying gene expression dependencies from high throughput data is a challenge given the complexity of interactions and the noisiness of even the best available data. We introduce a new method, Correlation Explanation, that uses information theoretic principles to understand gene expression data. The algorithm captures relationships among many genes by assuming they are related via connection to a common hidden explanatory factor. We apply correlation explanation to large scale expression data from 420 ovarian tumors to construct multiple layers of explanatory factors. The 5000+ genes analyzed in this study can be combined in an astronomical number of ways in principle. By searching this large space in an efficient way, the correlation explanation algorithm is able to gradually focus in on gene cohorts with strongly interacting expression patterns that are relatively independent between groups. We find that many of the groups of genes correspond to specific cellular biological processes on multiple levels, some of which can be implicated in tumor progression, metastasis, and patient survival. Putting these factors together allows each patient to have a unique summary profile of gene expression patterns that could someday be used to guide treatment decisions. ER -