RT Journal Article SR Electronic T1 Sequential compression across latent space dimensions enhances gene expression signatures JF bioRxiv FD Cold Spring Harbor Laboratory SP 573782 DO 10.1101/573782 A1 Gregory P. Way A1 Michael Zietz A1 Daniel S. Himmelstein A1 Casey S. Greene YR 2019 UL http://biorxiv.org/content/early/2019/03/11/573782.abstract AB Background Unsupervised machine learning algorithms applied to gene expression data extract latent, or hidden, signals representing technical and biological sources of variation. However, these algorithms require a user to select a biologically-appropriate latent dimensionality.Results We compressed gene expression data from three large transcriptomic datasets consisting of adult normal tissue, adult cancer tissue, and pediatric cancer tissue. Rather than selecting a single latent dimensionality, we sequentially compressed these data into many dimensions ranging from 2 to 200. We trained principal components analysis (PCA), independent components analysis (ICA), non-negative matrix factorization (NMF), denoising autoencoder (DAE), and variational autoencoder (VAE) models. We observed various tradeoffs for each model. For example, we observed high model stability between PCA, ICA, and NMF algorithms across latent dimensionalities. We identified more unique biological signatures in DAE and VAE model ensembles in intermediate latent dimensionalities. However, we captured the most pathway-associated features using all compressed features across algorithms, ensembles, and dimensions. We also used multiple latent dimensionalities to optimize gene expression signatures representing sample sex, neuroblastoma MYCN amplification, and various blood cell types, which generalized to external datasets. In supervised machine learning tasks, compressed features predicted cancer type and gene alteration status. In this setting, the best performing supervised models used features from different dimensionalities and compression algorithms indicating that there was no single best dimensionality or compression algorithm.Conclusions Ensembles of features from different unsupervised algorithms discover biological signatures in large transcriptomic datasets. To enhance biological signature discovery, rather than compressing input data into a single pre-selected dimensionality, it is best to perform compression on input data over many latent dimensionalities.RNAseqRNA sequencingPCAprincipal components analysisICAindependent components analysisNMFnon-negative matrix factorizationAEautoencoderDAEdenoising autoencoderVAEvariational autoencoderTCGAthe cancer genome atlasGTExgenome tissue expression projectTARGETtherapeutically applicable research to generate effective treatments projectBRCAbreast invasive carcinomaCOADcolon adenocarcinomaLGGlow grade gliomaPCPGpheochromocytoma and paragangliomaLAMLacute myeloid leukemiaLUADlung adenocarcinomaGEOgene expression omnibusROCreceiver operating characteristicPRprecision recallAUROCarea under the receiver operating characteristic curveAUPRarea under the precision recall curveCVcross validationORAoverrepresentation analysisGSEAgene set enrichment analysisSVDsingular value decompositionCCAcanonical correlation analysisSVCCAsingular vector canonical correlation analysisTFtranscription factorDMSOdimethyl sulfoxide