Abstract
Motivation Learning robust prediction models based on molecular profiles (e.g., expression data) and phenotype data (e.g., drug response) is a crucial step toward the development of precision medicine. Extracting a meaningful low-dimensional feature representation from patient’s molecular profile is the key to success in overcoming the high-dimensionality problems. Deep learning-based unsupervised feature learning has enormously improved image classification by enabling us to use large amounts of “unlabeled” images informative of the prediction task.
Approach We present the DeepProfile framework that attempts to extract latent variables from publicly available expression data using the variational autoencoders (VAEs) and use these latent variables as features for phenotype prediction. To our knowledge, DeepProfile is the first attempt to use deep learning to learn a feature representation from a large number of unlabeled (i.e, without phenotype) expression samples that are not incorporated to the prediction problem. We apply DeepProfile to predicting response to hundreds of cancer drugs based on gene expression data. Most patients with advanced cancer continue to receive drugs that are ineffective. This is exemplified by acute myeloid leukemia (AML), a disease for which treatments and cure rates (in the range of 25%) have remained stagnant. Effectively deploying an ever-expanding array of cancer drugs holds great promise to improve prognoses but requires methods to predict how drugs will affect specific patients.
Result We train the VAE model that represents a specific mapping from input variables (here, gene expression levels) into a much smaller number of latent variables, on the basis of gene expression data from AML patients available through the Gene Expression Omnibus (GEO). Our results show that the lower dimensional representation (i.e., latent variables) generated by using VAEs significantly outperform the original input feature representation (i.e., gene expression levels) in the drug response prediction problem.
Conclusion We demonstrate the effectiveness of VAEs in extracting a low-dimensional feature representation from publicly available unlabeled gene expression data. We show that the learned features are relevant to drug response prediction, which indicates that the latent variables capture important processes relevant to the prediction problem.
1 Introduction
The number of potential cancer drugs are rapidly increasing – more than 1,200 cancer medicines are in clinical development in the U.S. [33]. However, cure rates of acute myeloid leukemia (AML) have remained stagnant (in the range of 25%) [22]. Cancers that are pathologically similar to each other often respond to the same drug regime differently. There is a great need to develop computational methods to match patients to drugs based on their molecular properties and to identify molecular markers for each drug which reflects the molecular basis for drug sensitivity.
Due to the importance of the problem, numerous studies focused on cancer drug response prediction and used various machine learning (ML) algorithms on a diverse range of biological and molecular data such as gene expression, mutations, and copy number aberrations. Many public database provides measurements of drug responses in cancer cell lines. Most prominent of them include Cancer Genome Project (CGP) [9] containing tests on 130 drugs in 639 cell lines and Cancer Cell Line Encyclopedia (CCLE) [4] containing 24 drugs tested in 479 cell lines. Both of these studies used elastic net to discover novel gene-drug associations. Jang et al. also showed that regression methods like elastic net and ridge regression seem to work well on the cancer drug response prediction problem [13]. Several other studies worked on more complex machine learning algorithms to improve the accuracy of the prediction. Methods like support vector machine (SVM), least squares SVM, and random forest were applied by various studies [8], [2], [32]. Ensemble methods and multitask learning were also used. Costello et al. found Bayesian multitask multiple kernel learning (MKL) method to be the best performing method among other machine learning algorithms and gene expression data to be the most useful data for prediction [7]. Yuan et al. used multitask learning across cancer drugs in order to increase both the accuracy and interpretability of the prediction problem [37]. Lee and Celik et al. developed MERGE algorithm that integrates multi-omic prior information to discover robust gene-drug associations [22].
Several studies used deep learning for similar purposes. Menden et al. used neural networks for the cancer drug sensitivity prediction [24]. Rampasek et al. built variational autoencoder (VAE) models[20] to improve drug response prediction accuracy using pre- and post-treatment cell lines [28]. Way and Greene have used VAEs to learn biologically relevant latent space from The Cancer Genome Atlas (TCGA) pan-cancer data [35]. Our approach, namely DeepProfile, is different from the past studies in that, to our knowledge, DeepProfile is the first attempt to use deep learning to learn a feature representation from a large number of unlabeled (i.e, without phenotype) expression samples that are not incorporated to the prediction problem and use the feature representation to solve prediction tasks. We showed that DeepProfile results in significant improvement in the prediction performance on AML drug sensitivity prediction problem, which is better than other dimensionality reduction methods.
DeepProfile has three unique aspects compared to previous studies on drug sensitivity prediction or dimensionality reduction: (1) DeepProfile extracts a lower dimensional feature representation of a patient’s gene expression data by transferring information from many other patients with the same cancer type captured by the VAE model. (2) DeepProfile uses deep learning in order to learn non-linear mappings between genes and latent variables which might reveal deeper structures within the data and potentially capture complex, nonlinear relationships between gene expression and their complex traits (drug sensitivity). (3) DeepProfile shows significantly better prediction performance compared to other dimensionality reduction methods.
2 Methods
2.1 Datasets
We trained our VAE model using publicly available gene expression data from different Affymetrix microarray platforms which we downloaded from the National Center for Biotechnology Information (NCBI) Gene Expression Omnibus (GEO) database. These data consist of 4,367 leukemia patient samples, which include 2,813 with AML and others with ALL (acute lymphoblastic leukemia), CML (chronic myelogenous leukemia), CLL (chronic lymphocytic leukemia), BPDCN (blastic plasmacytoid dendritic cell neoplasm), or MDS (myelodysplastic syndrome). The details of the datasets collected from GEO are provided in Table 1.
The data we used to test the learned VAE model had been collected by the University of Washington Medical Center (UWMC) and consists of genome-wide gene expression data from 30 AML patient samples and in vitro drug sensitivity of these patients to 160 chemotherapy drugs, as introduced by Lee and Celik et al. [22].
We chose to use publicly available data from GEO to train our VAE model because this enables us to utilize a large number of training samples to learn low-dimensional embedding of high-dimensional gene expression data. We also believe that the VAE model learned by a large set of publicly available samples is more generalizable to broader leukemia (or AML) populations.
In order to integrate data from various platforms, we used Bioconductor annotation databases to convert the probe IDs specific to the array platforms to the human gene IDs. There are 4,051 genes that are present in all datasets listed in Table 1. We also standardized (i.e., made zero-mean and unit variance) each gene in each dataset before combining the datasets for learning the VAE model. This is done to ensure that different features (here, gene expression levels) are on the same scale. We finally applied batch effect correction on the data [16] to minimize the effect of potential confounders resulting from experimental variations.
2.2 The DeepProfile framework
We adopt a deep learning approach to learn a low-dimensional feature representation (or ‘embedding‘) for the gene expression data. A variational autoencoder (VAE) is an extension of a classical autoencoder and uses variational inference to infer the posterior of latent embeddings given input data (i.e.,р (𝓍|𝓍), where 𝓏refers to latent embeddings and x refers to input variables). Like a classical autoencoder, the VAE learns latent embeddings with the objective of minimizing reconstruction error. However, unlike a classical autoencoder, the VAE assumes that the posterior is Gaussian distributed with a standard Gaussian prior (i.e., N (0, 1)). This formulation enables us to learn network parameters using scalable optimization methods (such as adaptive moment estimation (Adam)) and reparameterization tricks. The learned decoder (i.e., р (𝓍 |(𝓏)) can then be used as a generative model to generate new samples from underlying latent embedding space. The standard normal prior forces the encoding and decoding networks to produce a generalizable, smooth latent space by learning meaningful features and embedding similar samples close together. We use a VAE model to learn meaningful latent features from the gene expression data of leukemia patients collected from publicly available datasets and use the learned latent features to predict the drug response of AML patients to various anti-cancer drugs. The DeepProfile framework is visualized in Figure 1.
Our VAE model consists of encoder and decoder networks both with 4 dense layers. The encoder network for means and standard deviations share the first three dense layers which have 1,024, 256, and 64 hidden units, respectively. All layers use batch normalization and rectified linear unit (ReLU) activation. The fourth dense layers have 8 hidden units (latent variable count) and are separately trained for means and standard deviations. Similarly, the decoder has 3 dense layers with 64, 256, and 1,024 hidden units with ReLU activation. The final layer has 4,051 hidden units (original data dimension) with identity activation. We use reconstruction error (i.e., mean squared error) and Kullback-Leibler (KL) divergence of the posterior and prior as an objective function. The network is trained by Adam method with a learning rate of 0.0005 [19]. Furthermore, we applied the warm-up process to gradually introduce a KL divergence term in the objective [18], starting with a scaling factor of 0 (corresponding to standard autoencoder) and slowly reaching to 1 (corresponding to standard VAE). The model is built using Keras.
2.3 Training and testing of the DeepProfile framework
After learning the VAE model, we used the inferred weights to encode an 8-dimensional feature vector for each of the 30 AML patients from which we have the drug response data. We then used the encoded low-dimensional representation (LDR) in an L1-regularized linear regression (for drug response prediction) or L1-regularized logistic regression (for complete remission class prediction) setting and measured the prediction performance. We carried out the drug response prediction task separately for each drug. We used leave-one-out cross-validation (LOOCV) to compute prediction error and used 5-fold cross-validation (CV) on the training samples to select the regularization parameter λ.
Since the VAE model is non-convex, the learned LDR is not unique. To ensure that our results takes into account the potential variation in the prediction performance due to the variability of the learned LDR, we trained the VAE model 10 times and repeated the prediction tasks explained above for each of the 10 different learned 8-dimensional LDRs. We included the error bars that represent one standard deviation across 10 VAE runs when we presented our results (Figures 2 and 3)
3 Results
We compared the learned VAE embeddings to the 16,864 gene expression levels measured in 30 AML patients (Figure 2), as well as to LDR inferred by other dimensionality reduction methods including k-means clustering and Principal Component Analysis (PCA) (Figure 3a). We evaluated our methods by predicting (i) drug response and (ii) complete remission.
3.1 Drug response prediction results
We used the same Lasso regression tests (Section 2.3) for each method in comparison, and measured LOOCV mean-squared error (MSE) for each of the 160 anti-cancer drugs. We trained our VAE model in two different settings using gene expression data from a different set of samples; (I) 4,367 samples from different leukemia types besides AML, and (II) 2,813 AML samples. We call the VAE models in Setting I and Setting II “VAE leukemia” and “VAE AML”, respectively. We used those different settings in order to examine how the diversity in the VAE training data affects the AML drug response prediction performance of the learned VAE latent representation. Each of the two settings makes use of 4,051 genes that are overlapping in all leukemia datasets (Table 1).
Figure 2a compares the average MSE over all drugs when we use the expression levels from 16,864 genes, VAE-Leukemia LDR, and VAE-AML LDR. We observed that both VAE-Leukemia LDR and VAE-AML LDR outperformed the gene expression levels in predicting drug response. The VAE-AML LDR led to a lower MSE than the VAE-Leukemia LDR, and reduced the MSE by 9.9% compared to the gene expression levels. We believe that the lower error we obtained from VAE-AML LDR compared to VAE-Leukemia LDR is because VAE can learn more AML-specific features in VAE-AML LDR that can be more useful for AML drug response prediction problem. Thus, even though eliminating other leukemia patients reduces the number of samples that VAE-AML LDR can use, the error is still reduced compared to VAE-Leukemia LDR.
Figure 2b shows the average MSE values for 44 drugs whose response is predicted well (i.e. MSE ≤ 0.7 achieved by at least one of the gene expression levels, VAE-Leukemia LDR, or VAE-AML LDR). For well-predicted drugs, both VAE-Leukemia LDR and VAE-AML LDR led to an average MSE lower than the one from the gene expression levels, and VAE-AML LDR reduced the average MSE by 15.2% compared to the gene expression levels.
Figure 2c compares the MSE values obtained by the gene expression levels and VAE-AML LDR for each of the 160 cancer drugs. For 68.1% of the drugs (109 out of 160 drugs), VAE-AML LDR out-performs the gene expression levels. When the MSE values are compared for only 44 well-predicted drugs (i.e. MSE ≤ 0.7 achieved by at least one of gene expression and VAE-AML), the VAE-AML LDR obtains a lower error than the gene expression for 65.9% of the drugs (29 out of 44 drugs). These results demonstrate that the DeepProfile model is successful at drug response prediction and especially VAE-AML LDR can reduce the prediction error significantly compared to the gene expression levels.
3.2 Additional drug response prediction results
We further investigated the drug response prediction performance of our DeepProfile framework by comparing to two other dimensionality reduction algorithms — k-means clustering and PCA. For k-means clustering, we learned 8 gene clusters and used the cluster centroids as LDR, while for PCA, we used top 8 principal components as LDR. We also analyzed in this section how the results are affected from the depth of the VAE model and the training data size.
Figure 3a compares the performance of the VAE-AML LDR with PCA and k-means clustering. VAE-AML LDR can outperform both PCA and k-means algorithms for the same training data and the same size of latent dimensions. This is potentially because non-linear dimensionality reduction of VAE produces more informative LDR relative to the linear methods.
Figure 3b illustrates the effect of using deeper VAE-AML models for the drug response prediction problem. Adding more layers to VAE models led to a higher performance, which is not surprising because deeper networks are able to discover complex non-linear associations among genes. Yet, when the networks are too deep, the learned VAE-AML LDR performs worse due to insufficient sample size.
Figure 3c demonstrates that the performance of VAE LDR increases with larger sample size, as expected. This indicates that our framework can further reduce the error with more samples.
3.3 Complete remission prediction results
In order to demonstrate that the LDR learned by VAE can effectively predict other phenotypes, we trained L1 regularized logistic regression on the learned VAE LDR to predict the complete remission phenotype of 30 AML patients. Complete remission for a cancer patient means that all signs of the cancer are removed by the therapy. We note that the patients are treated using a few common AML drugs in clinic while the drug response data that we use for the prediction problem we tackled in Section 3.1 and 3.2 are from in vitro testing of the tumor samples taken from the patients for 160 chemotherapy drugs.
Figure 4 compares VAE LDR to the gene expression levels and two other dimensionality reduction algorithms — PCA and k-means clustering — for predicting CR. The larger area under the ROC curve for VAE LDR shows that it outperforms the two other LDRs and the gene expression levels for CR prediction. This result demonstrates that the LDR learned by VAE can generalize to other prediction tasks.
4 Conclusion
In this paper, we present the DeepProfile framework that adopts the variational autoencoder (VAE) to learn low dimensional representation (LDR) from publicly available unlabeled (i.e., without pheno-type data) gene expression datasets and uses the extracted LDR to predict sensitivity to anti-cancer drugs and complete remission for AML patients. We observed that the LDR generated by VAE better predicted the drug response and complete remission than the original gene expression data and two other commonly used LDR learning methods. When we used samples from only AML patients, DeepProfile reduced the average error obtained by the gene expression by 9.9% for all 160 drugs and by 15.2% for the 44 best-performing drugs. Despite that the samples used in VAE training are obtained from many different studies carried in different countries and labs with different sequencing technologies, VAE is quite successful at disentangling the discrepancies in the data and creating an LDR that can be used for different cancer phenotype prediction purposes.
It is interesting to note that the performance of VAE does not only depend on the sample size but is also highly affected by the nature of data. We observed that when we added samples from patients with other types of leukemia, the prediction performance deteriorated. We hypothesized that, since different cancer subtypes have different characteristic and each cancer subtype shows specific molecular properties, adding more data from different leukemia types may not help extracting features important for AML.
Our future directions include: (1) improving our learning algorithm using semi-supervised VAE which benefits from labels of data while training the network, (2) incorporating RNA-seq data along with microarray data to increase the sample size for training VAE to allow it to discover further hidden characteristics from the data, and (3) extending the framework to different cancer types and building a generic tool that is useful for extracting latent features specific to different cancer types.
Footnotes
Email: suinlee{at}cs.washington.edu