Abstract
Motivation Learning robust prediction models based on molecular profiles (e.g., expression data) and phenotype data (e.g., drug response) is a crucial step toward the development of precision medicine. Extracting a meaningful low-dimensional feature representation from patient’s molecular profile is the key to success in overcoming the high-dimensionality problems. Deep learning-based unsupervised feature learning has enormously improved image classification by enabling us to use large amounts of “unlabeled” images informative of the prediction task.
Approach We present the DeepProfile framework that attempts to extract latent variables from publicly available expression data using the variational autoencoders (VAEs) and use these latent variables as features for phenotype prediction. To our knowledge, DeepProfile is the first attempt to use deep learning to learn a feature representation from a large number of unlabeled (i.e, without phenotype) expression samples that are not incorporated to the prediction problem. We apply DeepProfile to predicting response to hundreds of cancer drugs based on gene expression data. Most patients with advanced cancer continue to receive drugs that are ineffective. This is exemplified by acute myeloid leukemia (AML), a disease for which treatments and cure rates (in the range of 25%) have remained stagnant. Effectively deploying an ever-expanding array of cancer drugs holds great promise to improve prognoses but requires methods to predict how drugs will affect specific patients.
Result We train the VAE model that represents a specific mapping from input variables (here, gene expression levels) into a much smaller number of latent variables, on the basis of gene expression data from AML patients available through the Gene Expression Omnibus (GEO). Our results show that the lower dimensional representation (i.e., latent variables) generated by using VAEs significantly outperform the original input feature representation (i.e., gene expression levels) in the drug response prediction problem.
Conclusion We demonstrate the effectiveness of VAEs in extracting a low-dimensional feature representation from publicly available unlabeled gene expression data. We show that the learned features are relevant to drug response prediction, which indicates that the latent variables capture important processes relevant to the prediction problem.