A variational autoencoder trained with priors from canonical pathways increases the interpretability of transcriptome data

Interpreting transcriptome data is an important yet challenging aspect of bioinformatic analysis. While gene set enrichment analysis is a standard tool for interpreting regulatory changes, we utilize deep learning techniques, specifically autoencoder architectures, to learn latent variables that drive transcriptome signals. We investigate whether simple, variational autoencoder (VAE), and beta-weighted VAE are capable of learning reduced representations of transcriptomes that retain critical biological information. We propose a novel VAE which utilizes priors from biological data to direct the network to learn a representation of the transcriptome that is based on understandable biological concepts. After training five different autoencoder architectures on 22310 transcriptomes, we benchmarked their performance on organ and disease classification tasks on separate selection of 5577 test samples. Every tested architecture succeeded in reducing the transcriptomes to 50 latent dimensions, which captured enough variation for accurate reconstruction. The simple, fully connected autoencoder, performs best across the benchmarks, but lacks the characteristic of having directly interpretable latent dimensions. The beta-weighted, prior-informed VAE implementation is able to solve the benchmarking tasks, and provide semantically accurate latent features equating to biological pathways. This study opens a new direction for differential pathway analysis in transcriptomics with increased transparency and interpretability. Author summary The ability to measure the human transcriptome has been a critical tool to studying health and disease. However, transcriptomes data sets are too large and complex for direct human interpretation. Deep learning techniques such as autoencoders are capable of distilling high-level features from complex data. However, even if deep learning models find patterns, these patterns are not necessarily represented in a way that humans can easily understand. By bringing in the prior knowledge of biological pathways, we have trained the model to “speak the language” of the biologist, and represent complex transcrtomes, in simpler concepts that are already familiar to biologists. We can then apply the tool to compare for example samples from lung cancer cells to healthy cells, and show which biological processes are perturbed.

Transcriptomics is a powerful tool in characterizing cellular activity under various 2 conditions, which allows researchers to discover the underlying associations between 3 transcripts or genes and pathological or environmental factors. Therefore, 4 transcriptomics data are widely applicable in multiple areas of biomedical research, 5 varying from understanding disease mechanisms [1], detecting biomarkers [2,3], to 6 tissue-specific regulatory gene identification [4]. The broad application of the technology 7 leads to generating a considerable amount of transcriptomic sequencing data and 8 constructing a few specific public platforms hosting the relevant biological data sets, 9 such as ArrayExpress [5] and NCBI GEO [6]. In all these biomedical tasks, 10 human-understandable interpretation of the experimental transcriptomics data serves as 11 the pivotal component to understanding the underlying biology. However, this 12 interpretation remains a challenge in the face of large and complex data sets. Here we 13 explore the potential for machine learning models to learn a simplified representation of 14 the transcriptome in terms of commonly understood biological processes, and thus 15 increase the interpretability. 16 While gene set enrichment analysis (GSEA) [7] is a standard tool for interpreting 17 regulatory changes to the transcriptome, the method highly relies on a list of 18 well-detected differentially expressed genes (DEGs) between the conditions of interest. 19 Furthermore, most of the state-of-the-art models for differential expression analysis 20 (DEA) are based on the linear assumption across samples, the representatives of which 21 include the models from limma [8,9], DESeq2 [10], and Seurat [11]. However, variation 22 in measured expression levels, whether of biological or technical origin, may not always 23 behave linearly. Given the potential for synergistic effects between genes, this 24 assumption of linearity might lead to a loss of power. Complex sources of variation are 25 also present in combined public data sets, consisting of a large number of samples from 26 multiple sources, and influenced by non-biological factors such as batches or processing 27 centers. These limitations daunt the exploration of large-scale transcriptomics data for 28 answering fundamental biological questions. 29 The goal of this study is to utilize deep learning techniques to bring transparency 30 into which biological process patterns are represented in a transcriptome data set, and 31 thus to facilitate the interpretation of experimental results. Deep learning with artificial 32 neural networks has witnessed rapid development in recent years and outperformed the 33 traditional approaches in handling massive and complex data from multiple areas 34 because of its high-level feature extraction at a non-linear space and objective data 35 processing [12][13][14][15][16]. This development also enables more possibilities in understanding 36 the transcriptomics data and has successfully contributed, for example to drug 37 repurposing and development [17,18], phenotype classification [19] and genomics 38 functional characterization [20] with a more in-depth understanding of transcriptomics 39 data. 40 We are specifically interested in autoencoders as a class of methods which can reduce 41 the dimensionality and learn the major features in complex data [21]. Autoencoders model. However, these studies take different approaches to the challenge of associating 48 latent representations with human-understandable biological concepts. [22] implemented 49 an autoencoder and interpreted that latent space by correlating latent features with 50 phenotypes post hoc. [20] on the other hand sought to constrain the latent space to 51 represent known pathways by restricting the network connectivity, i.e., each latent node 52 represents a pathway, and only genes known to be involved in that pathway are 53 connected to the node upstream. In the study of [22], the network is free to learn any 54 representation from that data, but the burden of interpretation is left to post hoc 55 analysis. In the case of [20], the network is restricted directly in its architecture based 56 on gene set definitions.

57
Here, we see an opportunity to implement a solution that finds a middle path in 58 which the network is encouraged to learn a latent representation based on known 59 biological concepts, but still has the freedom to learn relationships among genes from 60 the data. Specifically, we propose an autoencoder variant using a novel technique of 61 introducing pathway-informed priors. The basis for this approach is the Bayesian 62 framework implemented in variational autoencoders (VAE) [27]. The VAE framework 63 inherently provides the opportunity to involve priors in the training process to learn 64 latent representations. With the introduction of biologically meaningful priors, our 65 approach here aims to integrate prior knowledge from the domain (here, we use 66 Hallmark pathways defined by MSigDB as an example) and still retain the flexibility of 67 the data-driven deep learning approach. This approach is most comparable to those 68 presented by Zhao [25] and Lotfollahi [26], with the commonality that the goal is to 69 produce latent features that correspond to known biological concepts. However, 70 compared to our approach of incorporating canonical pathway knowledge as priors, both 71 Zhao [25] and Lotfollahi [26] provide pathway definitions to constrain the decoder 72 architecture. The prior-based approach provides an additional opportunity to calibrate 73 the strength of the effect of prior following the established beta-VAE approach ( [28]).

74
Specifically, We make use of a hyperparameter, beta, in a similar way described 75 by [28], which can add weight to the influence of the priors on the training solution.

76
Thus, using the beta, we can control the extent to which the model conforms to 77 pathway concepts previously defined by MSigDB versus being free to define latent 78 variables in any way that best encodes the transcriptome in a reduced representation.

79
The fine-tuning of this hyperparameter enables control over the tension between direct 80 biological interpretability and the ability to deviate from canon or find new patterns.

104
The transcriptome data were prepared as input into the autoencoders in three 105 variations. In the first input variation, the data were fed into the model on the  The loss function for the simpleAE is the mean squared error (MSE) between the 143 decoder output and the original input values. For VAE, the loss function has two terms: 144 (i) the reconstruction loss, also MSE, and (ii) the KL divergence between the latent The preparation of pathway-informed priors 148 Pathway-informed prior distributions were generated for each sample in the form of (σ 2 ) 149 and (µ) parameters using a bootstrapping procedure. We defined µ as the average gene 150 expression level of genes within each pathway definition, and σ as the variance across 151 bootstrapping iterations. Pathway definitions were taken from MSigDB Hallmarks [35]. 152 Loss function for biological priors 153 Generally, variational autoencoders take the form found in Fig 1 and have been   154 previously described in detail [27]. In short, it has been established that neural network 155 training can perform variational inference when the loss function takes the form: Here, x ∈ R is a vector of expression values for a sample in the set of all samples, X.

157
The generative model p θ (x|z) is learned, where z (z ∈ R) are latent variables with a 158 prior p(z) such that z can generate the observed data x. An overview of the system architecture with different input levels. A pathway-derived prior is generated on the transcript or gene level, as described in the following sections. The training and post hoc are repeated with input on the transcript, gene, and community levels. B: As a latent variable framework, the model assumes that latent variables (Z) are determinants of the measured data (X). To learn Z, p(Z|X) is approximated by q(Z|X), and modeled as the encoder portion of the network. The latent values represent probability distributions, which are implemented in the bottle-neck layer as values for mu (µ) and sigma (σ). Finally, the decoder is conceptually equivalent to p(X|Z).
The reconstruction loss term L recon can be measured by the mean squared error 160 (MSE) between the input and the reconstructed output. In most VAE implementations, 161 a Gaussian distribution is used for p(z) to make a tractable KL calculation. The 162 assumption of unit Gaussian, N (0, I), as priors leads to the simplified expression: In order to implement pathway-informed priors with any parameter values, the KL 164 divergence had to be implemented more generically for any two Gaussian distributions 165 to measure the distance between q ϕ (z|x) and p(z), where q ϕ (z|x) ∼ N (µ 1 , σ 2 1 ) and p(z) 166 ∼ N (µ 2 , σ 2 2 ): The models beta-simpleVAE and beta-priorVAE involve the addition of 168 hyperparameter β (β > 1) to put more weight on the L KL term, as described in [28]: collapsing the inputs to the gene level and even further to 'representative genes' using a 175 network-based clustering technique. A comparison of the choice of inputs can be found 176 in S1 Fig   To further explore reconstruction performance, we computed correlation coefficients 198 between input transcriptomes vs. the output transcriptomes. A complete pairwise 199 correlation analysis across samples shows how similar input and output transcriptomes 200 are to each other in the context of the natural variability across samples (Fig 2). Each 201 model was able to reproduce reasonable output transcriptomes, which correlated with R 202 between 0.97 and 0.8. In most cases, the closest pairwise correlations were between 203 the input and output models, although this was not the case for many samples in the  For the benchmarking task of classifying tissue types, eight target organs were 215 selected based on available samples. In this classification task, the simpleAE, 216 simpleVAE and priorVAE perform with an average precision above 80%. Beta-weighting 217 leads to worse classification performance in both beta-simpleVAE and beta-priorVAE. 218 However, beta-priorVAE performs much better in this validation task than the former 219 (roughly 0.76 vs. less than 0.6).

220
For the task of distinguishing adenocarcinoma from healthy controls, all five 221 architectures perform comparably well, with precision between 98.6 and 100 percent.

222
Four architectures, (all but beta-simpleVAE) can still distinguish the adenocarcinoma 223 Fig 5. The results of the prior variants (priorVAE and beta-priorVAE) on the classification of lung cancer subtypes (adenocarcinoma vs. small cell lung cancer). A and B: Biplots and principle components analysis (PCA) results of priorVAE and beta-priorVAE. C and D: Heatmaps of the latent results based on selected disease samples for priorVAE and beta-priorVAE. Data is scaled across the fifty latent dimensions. E and F: Two-sided t-test results of the pathway dimensions with the two disease statuses. The significance levels (the p-values) are transformed in the -log10 form and shown on the y-axis. The size of the points represents the correlation between the prior and the latent µ, with a larger size indicating a higher level of correlation. The dimensions with a negative correlation are colored in red. G: The average precision score of the multivariate logistic regression models sourcing from the latent representation of the models on the task of adenocarcinoma vs. SCLC classification, showing that each model type is well suited for this classification task. cell line, a subpopulation of the non-small cell lung cancer (NSCLC), from the small cell 224 lung cancer (SCLC) cell lines with 100 percent of average precision. The performance of 225 beta-simpleVAE latent results as a classifier slightly decreased from 98.6% to 97.5%, and 226 the standard deviation (sd) across the five-fold validation increased from 0.016 to 0.049. 227

Influence of priors on learned latent representation 228
The goal of bringing pathway-derived priors into the model is to produce latent 229 representations that both accurately represent the full transcriptome and also directly 230 correspond to recognizable biological concepts. However, these goals are at odds, as we 231 reported in (Fig 6 and S2 Fig). In the case of the priorVAE, the latent variable allograft 232 rejection does retain a high correlation with the prior scores (R = 0.82). However, all 233 the other 49 pathways are with a correlation weaker than 0.5, and only six pathways are 234 to some extent correlated with the prior, even with a far less stringent threshold of 0.4. 235 21 out of 50 pathways correlate negatively with the prior scores. In contrast, the latent 236 variables of the beta-priorVAE retain their connection to their pathways, with 48 of 50 237 pathways correlating higher than 0.6. Fig 6C indicates that the beta-priorVAE provides 238 latent values with a high level of semantic meaning, providing a direct means for 239 interpreting complex transcriptome data sets.  (Fig 4). However, for beta-priorVAE interferon alpha response, and MYC targets v2 are 246 the main distinguishing features. In a direct comparison of adenocarcinoma and SCLC, 247 the biplots show that angiogenesis and spermatogenesis are the two pathways with the 248 highest load to the first principle component (PC) in priorVAE, and MYC targets v2 249 and spermatogenesis in the case of beta-priorVAE.

255
These differences between priorVAE and beta-priorVAE are explained to a large 256 extent by the fact that the beta-priorVAE produces latent variables which track more 257 closely with the pathway-based priors (Fig 6). A direct interpretation of the involvement 258 of a given pathway is only possible when that latent feature is highly correlated with 259 the prior distributions. In the case of adeno vs. small cell, the most distinguishing 260 features, glycolysis and angiogenesis have only weak correlations with their priors: R = 261 0.043 and -0.075, respectively. In contrast, for the beta-priorVAE, the most statistically 262 significant feature distinguishing adeno from the small cells is coagulation, which is a 263 feature that is also highly correlated with its prior (R=0.96). Therefore, it is only for  The primary goal of utilizing pathway-based priors in the priorVAE and 291 beta-priorVAE models was to generate a latent space that would be immediately 292 interpretable to a biologist because the model will describe a transcriptome in terms of 293 features that are familiar to a biologist. However, we have observed that latent features 294 do not always retain the identity of their associated priors. In the case of the priorVAE 295 (i.e. beta = 1), the model has substantial freedom to deviate from the pathway priors, 296 and in fact, does so for many features. However, by boosting the requirement to adhere 297 to the pathway concepts with the beta hyperparameter, the immediate biological 298 interpretation of the latent space is achievable (Fig 6C). A notable exception is the case 299 of spermatogenesis, which was the feature that had the lowest correlation with the prior. 300 This indicates that the model does require some freedom to discover major sources of also be technical sources of variation that need to be accounted for that would not be 305 represented in a pathway database. The presence of unanticipated sources of variation 306 indicates that an interesting future direction for this research would be to include a few 307 "wild-card" nodes in the latent space, with unit-Gaussian priors that are not driven by 308 pathway data. This modification would allow the modeling to account for "unexpected" 309 sources of variation while at the same time utilizing prior pathway information when 310 possible. The burden of interpreting wild-card nodes would be placed on additional post 311 hoc analysis.

312
The beta-priorVAE with the current setting shows an overall satisfying correlation 313 between the latent variables and prior scores, which enables an interpretation of the 314 biological pathways involved in the chosen vignette. Based on the t-test results (Fig 5), 315 coagulation is the most significant differentiator between the two lung cancer 316 phenotypes (adenocarcinoma as the representative NSCLC and the SCLC). While 317 coagulation function is associated with the prognosis in NSCLC patients as described 318 in [39,40], both [41] and [42] reported an absence of similar correlation between 319 coagulation and the SCLC prognosis. The literature is, therefore, consistent with the 320 latent representation in that the concept of coagulation is a differentiating feature 321 between the two diseases. Thus, the evaluation supports the feasibility of such 322 architecture in making transcriptomes more intuitively transparent and interpretable.

323
Although our primary motivation for including priors in our VAE was to make the 324 latent space directly interpretable, the traditional motivation for including priors was to 325 increase model accuracy. For most of our benchmarks, model accuracy generally 326 decreased when we increased the emphasis on priors via the beta parameter. This 327 implies that the reconstruction portion of the loss function is the main driver of 328 performance in these benchmarks in comparison to the KL divergence term. However, 329 an interesting point of comparison in our experiments is between the beta-simpleVAE 330 and beta-priorVAE, which have the same emphasis on priors (i.e., the same betas), but 331 in the latter model, prior biological knowledge is incorporated. Table 2 and S3 Fig show 332 an increase in performance when using a biological prior vs. a unit Gaussian prior, 333 indicating that in this local comparison, prior biological information can be beneficial. 334 We have provided evidence that the key sources of biological variation are captured 335 in the latent space. At the primary level, the successful reconstruction as demonstrated 336 by the reconstruction loss, as well as high correlation coefficients between inputs and 337 outputs, indicate that the latent representations are reliable. This is further supported 338 by the fact that cancer types and tissue types can be distinguished using only the latent 339 features. However, the performance for classifying tissues is surprisingly mediocre. The 340 question is whether this is a limitation of the information found in the latent 341 representation or an inadequate classification procedure. The latter scenario is 342 supported by the high reconstruction accuracy and the fact that the multivariate 343 logistic regression maybe be inadequate without proper feature selection.

345
The training and post hoc experiments demonstrate that (i) autoencoder models can 346 find simplified representations of transcriptomes that still retain biological information, 347 (ii) using pathway-derived priors, we can encourage the models to find latent 348 representations that still adhere to concepts that are familiar to biologists, and (iii) The source code for this paper is available on GitHub with the following link: https://github.com/BinLiu9205/deepRNA autoencoder.git and on figshare with the following link: https://figshare.com/articles/software/deepRNA autoencoder/22227217