Abstract
The existing large gene expression data repositories hold enormous potential to elucidate disease mechanisms, characterize changes in cellular pathways, and to stratify patients based on their molecular profile. To achieve this goal, integrative resources and tools are needed that allow comparison of results across datasets and data types. We propose an intuitive approach for data-driven stratifications of molecular profiles and benchmark our methodology using the dimensional reduction algorithm t-SNE with multi-center and multi-platform data representing hematological malignancies. Our approach enables assessing the contribution of biological versus technical variation to sample clustering, direct incorporation of additional datasets to the same low dimensional representation of molecular disease subtypes, comparison of sample groups between separate t-SNE representations, or maps, and characterization of the obtained clusters based on pathway databases and additional multi-omics data. In the example application, our approach revealed differential activity of SAM-dependent DNA methylation pathway in the acute myeloid leukemia patient cluster characterized with CEBPA mutations that accordingly was validated to have globally elevated DNA methylation levels.