Sequential compression of gene

: 24 Background 25 Unsupervised compression algorithms applied to gene expression data extract latent, or 26 hidden, signals representing technical and biological sources of variation. However, these 27 algorithms require a user to select a biologically-appropriate latent dimensionality. In practice, 28 most researchers select a single algorithm and latent dimensionality. We sought to determine 29 the extent by which using multiple dimensionalities across ensemble compression models 30 improves biological representations.


Abstract: 24
Background 25 Unsupervised compression algorithms applied to gene expression data extract latent, or 26 hidden, signals representing technical and biological sources of variation. However, these 27 algorithms require a user to select a biologically-appropriate latent dimensionality. In practice, 28 most researchers select a single algorithm and latent dimensionality. We sought to determine 29 the extent by which using multiple dimensionalities across ensemble compression models 30 improves biological representations. 31 Results 32 supervised machine learning tasks, compressed features optimized predictions at different 44 latent dimensionalities. 45

Conclusions 46
There is no single best latent dimensionality or compression algorithm for analyzing 47 gene expression data. Instead, using feature ensembles from different compression models 48 across latent space dimensionalities optimizes biological representations. features that discriminate cancer subtypes and drug response [8,9]. Additional latent variable 63 approaches have been used to detect and remove technical artifacts, including batch effects 64 107 Figure 1: Overview of the BioBombe approach. We implemented BioBombe on three datasets 108 using five different algorithms. We sequentially compressed input data into various latent 109 dimensionalities. We calculated various metrics that describe different benefits and trade-offs 110 of the algorithms. Lastly, we implemented a network projection approach to interpret the 111 compressed latent features. We used MSigDB collections and xCell gene sets to interpret 112 compressed features. 113 114

Assessing compression algorithm reconstruction 115
Reconstruction cost, a measurement of the difference between the input and output 116 matrices, is often used to describe the ability of compression models to capture fundamental 117 processes in latent space features that recapitulate the original input data. We tracked the 118 reconstruction cost for the training and testing data partitions for all datasets, algorithms, 119 latent dimensionalities, and random initializations. As expected, we observed lower 120 reconstruction costs in models trained with real data and with higher latent dimensionalities 121 Evaluating model stability and similarity within and across latent dimensionalities 127 We applied singular vector canonical correlation analysis (SVCCA) to algorithm weight 128 matrices to assess model stability within algorithm initializations, and to determine model 129 similarity between algorithms [21]. Briefly, SVCCA calculates similarity between two 130 compression algorithm weight matrices by learning appropriate linear transformations and 131 iteratively matching the highest correlating features. Training with TCGA data, we observed 132 highly stable models within algorithms and within all latent dimensionalities for PCA, ICA, NMF 133 (along the matrix diagonal in Fig 2a). VAE models were also largely stable, with some decay in 134 higher latent dimensionalities. However, DAE models were unstable, particularly at low latent 135 dimensionalities (Fig 2a). We also compared similarity across algorithms. Because PCA and ICA 136 are rotations of one another, we used the high stability as a positive control for SVCCA 137 estimates. NMF was also highly similar to PCA and ICA, particularly at low latent 138 dimensionalities (Fig. 2a). VAE models were more similar to PCA, ICA, and NMF than DAE 139 models, particularly at low latent dimensionalities, and the instability patterns within DAE 140 models also led to large differences across algorithms (Fig. 2a). We observed similar patterns in 141 GTEx and TARGET data, despite TARGET containing only about 700 samples (Additional File 1: 142 Figure S4). 143 . CC-BY 4.0 International license is made available under a The copyright holder for this preprint (which was not peer-reviewed) is the author/funder. It . https://doi.org/10.1101/573782 doi: bioRxiv preprint  We also used SVCCA to compare the similarity of weight matrices across latent 155 dimensionalities. Both PCA and ICA found highly similar solutions (Fig. 2b) (Fig. 2b). We observed similar patterns in GTEx and TARGET data (Additional 162 File 1: Figure S5) initializations, algorithms, and latent dimensionalities. We optimally identified this phenotype 171 in higher latent dimensionalities, particularly in NMF and VAE models (Fig. 3a). The top feature 172 separating GTEx males and females was NMF feature 111 in k = 200 (t = 44.5, p = 7.3 x 10 -176 ) 173 (Fig 3b). We examined the genes that contribute with high weight to this feature and found 174 only three genes had substantial influence. These three genes all had high positive weights and 175 were encoded on the Y chromosome. We performed the same approach using BioBombe 176 . CC-BY 4.0 International license is made available under a The copyright holder for this preprint (which was not peer-reviewed) is the author/funder. It . https://doi.org/10.1101/573782 doi: bioRxiv preprint features to identify sex features in TCGA test data (Fig. 3c). The top latent dimensionality 177 identified was not consistent across algorithms. The top feature distinguishing TCGA males and 178 females was ICA feature 53 in the k = 90 model (t = 4.9, p = 2.0 x 10 -6 ) (Fig. 3d). The separation 179 was not as strong using the more complex TCGA data, but the top 10 gene weights were all 180 encoded on the X chromosome. 181 We also tested the ability of features derived from the BioBombe approach to 182 distinguish MYCN amplification in neuroblastoma (NBL) tumors. MYCN amplification is a 183 biomarker associated with poor prognosis in NBL patients [24]. We performed a two-tailed t-184 test comparing MYCN amplified vs. MYCN not amplified NBL tumors for each of the latent 185 features derived from the full set of TARGET samples. Each algorithm discovered optimal signal 186 at various latent dimensionalities, but the top scoring features were generally identified in VAE 187 and NMF models with large latent space dimensionalities (Fig. 3e). Although there were some 188 potentially mischaracterized samples, feature 12 in VAE k = 50 robustly separated MYCN 189 amplification status in NBL tumors (t = -18.5, p = 6.6 x 10 -38 ) (Fig. 3f). This feature also 190 distinguished MYCN amplification status in NBL cell lines [25] that were previously not used in 191 training the compression model or for feature selection (t = -3.2, p = 4.2 x 10 -3 ) (Fig. 3g) algorithms, and initializations. We applied a network projection approach to all compression 212 algorithm weight matrices to determine gene set coverage. Briefly, we projected all 213 compressed features onto a gene set network and assigned gene sets with the highest 214 enrichment that passed an adjusted statistical significance threshold to each compressed 215 feature (see methods for more details). We tracked coverage of three MSigDB gene set 216 collections representing transcription factor (TF) targets, cancer modules, and Reactome 217 pathways across latent dimensionalities in TCGA data (Fig. 4). In all cases, we observed higher 218 gene set coverage in models with larger latent dimensionalities. Considering individual models, 219 we observed high coverage in PCA, ICA, and NMF. In particular, ICA outperformed all other 220 algorithms (Fig. 4a). However, while these methods showed the highest coverage, the features 221 identified had relatively low enrichment scores compared to AE models (Additional File 1: 222

Figure S6). 223
Aggregating all five random initializations into ensemble models, we observed 224 substantial coverage increases, especially for AEs (Fig. 4b) (Fig. 4b). When considering all models 228 combined (forming an ensemble of algorithm ensembles) within latent dimensionalities, we 229 . CC-BY 4.0 International license is made available under a The copyright holder for this preprint (which was not peer-reviewed) is the author/funder. It . https://doi.org/10.1101/573782 doi: bioRxiv preprint observed substantially increased coverage of all gene sets. However, most of the unique gene 230 sets were contributed by the AE models (Fig. 4c). The copyright holder for this preprint (which was not peer-reviewed) is the author/funder. It . https://doi.org/10.1101/573782 doi: bioRxiv preprint Lastly, when we aggregated all BioBombe features across all algorithms and all latent 243 dimensionalities together into a single model, we observed the highest gene set coverage (Fig.  244   4c). These patterns were consistent across other gene set collections and datasets (Additional 245 File 1: Figure S7). In general, while models compressed with larger latent space dimensionalities 246 had higher gene set coverage, many individual gene sets were captured with the highest 247 enrichment in models with low and intermediate dimensionalities (Additional File 1: Figure S8). 248 These results did not reveal a best method or dimensionality: Various biological representations 249 are best discovered by using various compression algorithms with various latent space 250 dimensionalities. 251

252
Observing the latent dimensionality of specific tissue and cell type signatures 253 We measured the Pearson correlation between all samples' gene expression input and 254 reconstructed output. As expected, we observed increased mean correlation and decreased 255 variance as the latent dimensionalities increased in TCGA data (Fig. 5a). We also observed 256 similar patterns in GTEx and TARGET data (Additional File 1: Figure S9). Across all datasets, in 257 randomly permuted data, we observed correlations near zero (Additional File 1: Figure S9). The 258 correlation with real data was not consistent across all algorithms as PCA, ICA, and NMF 259 generally outperformed the AE models. 260 We tracked correlation differences across latent dimensionalities to determine the 261 dimensionality at which specific sample types are initially detected. Most cancer types, 262 including breast invasive carcinoma (BRCA) and colon adenocarcinoma (COAD), displayed 263 relatively gradual increases in sample correlation as the latent dimensionality increased (Fig.  264 . CC-BY 4.0 International license is made available under a The copyright holder for this preprint (which was not peer-reviewed) is the author/funder. It . https://doi.org/10.1101/573782 doi: bioRxiv preprint 5b). However, in other cancer types, such as low grade glioma (LGG), pheochromocytoma and 265 paraganglioma (PCPG), and acute myeloid leukemia (LAML), we observed large correlation 266 gains with a single increase in latent dimensionality (Fig. 5c). We also observed similar 267 performance spikes in GTEx data for several tissues including liver, pancreas, and blood (Fig.  268   5d). This sudden and rapid increase in correlation in specific tissues occurred at different latent 269 dimensionalities for different algorithms, but was consistent across algorithm initializations. 270 To determine if this rapid increase was a result of models learning specific biological 271 representations or if this observation represented a technical artifact, we more closely 272 examined the sharp increase in GTEx blood tissue correlation between latent space 273 dimensionalities 2 and 3 in VAE models (See Fig. 5d). We hypothesized that a difference in 274 reconstruction for a specific tissue at such a low dimensionality could be driven by a change in 275 the cell types captured by the model. We applied network projection of xCell gene sets to all 276 compressed features in both VAE models. xCell gene sets represent computationally derived 277 cell type signatures [27]. The top features identified for the VAE k = 2 model included skeletal 278 muscle, keratinocyte, and neuronal gene sets (Fig. 6a). Skeletal muscle was the most significant 279 gene set identified likely because it the tissue with the most samples in GTEx. Similar gene sets 280 were enriched in the k = 3 model, but we also observed enrichment for a specific neutrophil 281 gene set ("Neutrophils_HPCA_2") ( Fig. 6a). Neutrophils represent 50% of all blood cell types, 282 which may explain the increased correlation in blood tissue observed in VAE k = 3 models. The 283 features implicated using the network projection approach were similar to an 284 overrepresentation analysis using high weight genes in both tails of the VAE k = 3 feature 285 (Additional File 1: Figure S10). 286 . CC-BY 4.0 International license is made available under a The copyright holder for this preprint (which was not peer-reviewed) is the author/funder. It . https://doi.org/10.1101/573782 doi: bioRxiv preprint We also calculated the mean absolute value z scores for xCell gene sets in all 294 compression features for both VAE models with k = 2 and k = 3 dimensionalities (Fig. 6b). Again, 295 we observed skeletal muscle, keratinocytes, and neuronal gene sets to be enriched in both 296 models. However, we also observed a cluster of monocyte gene sets (including 297 "Monocytes_FANTOM_2") with enrichment in k = 3, but low enrichment in k = 2 (Fig. 6b). 298 . CC-BY 4.0 International license is made available under a The copyright holder for this preprint (which was not peer-reviewed) is the author/funder. It . https://doi.org/10.1101/573782 doi: bioRxiv preprint Monocytes are also important cell types found in blood, and it is probable these signatures also 299 contributed to the increased correlation for the reconstructed blood samples in VAE k = 3 300 models. We provide the full list of xCell gene set genes for the neutrophil and monocyte gene 301 sets that intersected with the GTEx data in Additional File 3. 302 We scanned all other algorithms and latent dimensionalities to identify other 303 compression features with high enrichment scores in the "Neutrophils_HPCA_2" (Fig. 6c) and 304 "Monocytes_FANTOM_2" gene sets (Fig. 6d). We observed stronger enrichment of the 305 "Neutrophil_HPCA_2" gene set in AE models compared to PCA, ICA, and NMF, especially at 306 lower latent dimensionalities. We observed the highest score for the "Neutrophil_HPCA_2" 307 gene set at k = 14 in VAE models (Fig. 6c). The top VAE feature at k = 14 correlated strongly with 308 the VAE feature learned at k = 3 (Additional File 1: Figure S10). Conversely, PCA, ICA, and NMF 309 identified the "Monocytes_FANTOM_2" signature with higher enrichment than the AE models 310 ( Fig. 6d). We observed a performance spike at k = 7 for both PCA and NMF models, but the 311 highest enrichment for "Monocytes_FANTOM_2" occurred at k = 200 in NMF models. 312 313

Validating GTEx neutrophil and monocyte signatures in external datasets 314
We downloaded a processed gene expression dataset (GSE103706) that applied two 315 treatments to induce neutrophil differentiation in two leukemia cell lines [29]. We hypothesized 316 that projecting the dataset on the "Neutrophil_HPCA_2" signature would reveal differential 317 scores in the treated cell lines. We observed large differences in sample activations of treated 318 vs untreated cell lines in the top Neutrophil signature (VAE k = 14) (Fig. 6e). We also tested the 319 "Monocytes_FANTOM_2" signature on a different publicly available dataset (GSE24759) 320 . CC-BY 4.0 International license is made available under a The copyright holder for this preprint (which was not peer-reviewed) is the author/funder. It . https://doi.org/10.1101/573782 doi: bioRxiv preprint measuring gene expression of isolated cell types undergoing hematopoiesis [30]. We observed 321 increased scores for isolated monocyte cell population (MONO2) and relatively low scores for 322 several other cell types for top VAE features (Fig. 6f). 323 We applied the top signatures for the neutrophil and monocyte gene sets to each 324 external dataset (see Fig. 6c, d). We observed variable enrichment patterns across different 325 algorithms and latent dimensionalities (Additional File 1: Figure S11a). These separation 326 patterns were associated with network projection scores in NMF models, but were not 327 consistent with other algorithms (Additional File 1: Figure S11b). Taken together, in this 328 analysis we determined that 1) adding a single latent dimensionality that captured Neutrophil 329 and Monocyte signatures improved signal detection in GTEx blood, 2) these gene expression 330 signatures are enhanced at different latent dimensionalities and by different algorithms, and 3) 331 these signatures generalized to external datasets that were not encountered during model 332 training. 333 . CC-BY 4.0 International license is made available under a The copyright holder for this preprint (which was not peer-reviewed) is the author/funder. It . https://doi.org/10.1101/573782 doi: bioRxiv preprint The copyright holder for this preprint (which was not peer-reviewed) is the author/funder. It . https://doi.org/10.1101/573782 doi: bioRxiv preprint

Using BioBombe features in supervised learning applications 347
We used BioBombe compressed features in two supervised machine learning tasks. 348 First, we trained logistic regression models using compressed BioBombe features from 349 individual model iterations as input to predict each of the 33 different TCGA cancer types. 350 Nearly all cancer types could be predicted with high precision and recall (Additional File 1: 351 Figure S12). We observed multiple performance spikes at varying latent dimensionalities for 352 different cancer types and algorithms, which typically occurred in small latent dimensionalities 353 ( Fig. 7a). Next, we input BioBombe features into the supervised classifier to predict samples 354 with alterations in the top 50 most mutated genes in TCGA (Additional File 1: Figure S13). We 355 focused on predicting four cancer genes and one negative control; TP53, PTEN, PIK3CA, KRAS, 356 and TTN (Fig. 7b). TTN is a particularly large gene and is associated with a high passenger 357 mutation burden and should provide no predictive signal [31]. As expected, we did not observe 358 any signal in predicting TTN (Fig. 7b). Again, we observed performance increases at varying 359 latent dimensionalities across algorithms. However, predictive signal for mutations occurred at 360 higher latent dimensionalities compared to cancer types (Fig. 7c). Compared to features trained 361 within algorithm and within iteration, an ensemble of five VAE models and an ensemble of five 362 models representing one iteration of each algorithm (PCA, ICA, NMF, DAE, and VAE) identified 363 cancer type and mutation status in earlier dimensionalities compared to single model iterations 364 (Fig 7c). We also tracked the logistic regression coefficients assigned to each compression 365 feature. DAE models consistently displayed sparse models, and the VAE ensemble and model 366 ensemble also induced high sparsity (Fig. 7d). 367 . CC-BY 4.0 International license is made available under a The copyright holder for this preprint (which was not peer-reviewed) is the author/funder. It . https://doi.org/10.1101/573782 doi: bioRxiv preprint Lastly, we trained logistic regression classifiers using all 30,850 BioBombe features 368 generated across iterations, algorithms, and latent dimensionalities. These models were sparse 369 and high performing; comparable to logistic regression models trained using raw features (Fig.  370   7e). Of all 30,850 compressed features in this model, only 317 were assigned non-zero weights 371 (1.03%). We applied the network projection approach using Hallmark gene sets to interpret the The copyright holder for this preprint (which was not peer-reviewed) is the author/funder. It . https://doi.org/10.1101/573782 doi: bioRxiv preprint The copyright holder for this preprint (which was not peer-reviewed) is the author/funder. It . https://doi.org/10.1101/573782 doi: bioRxiv preprint dimensionalities. However, if the analysis goal includes learning biological signatures to identify 422 more subtle patterns in input samples, then there is not a single optimal latent dimensionality 423 nor optimal algorithm. For example, though ICA and PCA represent rotations of each other, we 424 found that the methods varied in their ability to capture specific biological signals into single 425 features, which highlights the challenge of picking only a single algorithm. While compressing 426 data into a single latent dimensionality will capture many biological signals, the "correct" However, rather than using heuristics to select a biologically-appropriate latent dimensionality, 442 . CC-BY 4.0 International license is made available under a The copyright holder for this preprint (which was not peer-reviewed) is the author/funder. It . https://doi.org/10.1101/573782 doi: bioRxiv preprint a researcher may instead elect to compress gene expression data into many different latent 443 space dimensionalities to generate many different feature representations. 444 There are many limitations in our evaluation of compressing gene expression data into 445 many latent dimensionalities across multiple methods. First, our approach takes a long time to 446 run. We are training many different algorithms across many different latent dimensionalities 447 and iterations, which requires a lot of compute time (Additional File 1: Figure S14). However, 448 because we are training many models independently, this task can be parallelized. Additionally, 449 we did not evaluate dimensionalities above k = 200. It is likely that many more signatures can 450 be learned, and possibly with even higher association strengths in higher dimensionalities for 451 certain biology. We also did not focus on detecting compressed features that represent 452 technical artifacts, which has already been widely explored [47,48]. Moreover, we did not The copyright holder for this preprint (which was not peer-reviewed) is the author/funder. It . https://doi.org/10.1101/573782 doi: bioRxiv preprint feature distribution to be interpreted separately in algorithms that also learn negative weights. 465 The weight distribution is dependent on the specific compression algorithm, and the same 466 The copyright holder for this preprint (which was not peer-reviewed) is the author/funder. It . https://doi.org/10.1101/573782 doi: bioRxiv preprint

Methods: 487
Transcriptomic compendia acquisition and processing 488 We downloaded transcriptomic datasets from publicly available resources. We Independently for each dataset (TCGA, GTEx, and TARGET), we performed the following 528 procedure to train the compression algorithms. First, we randomly split data into 90% training 529 and 10% testing partitions. We balanced each partition by cancer type or tissue type, which 530 . CC-BY 4.0 International license is made available under a The copyright holder for this preprint (which was not peer-reviewed) is the author/funder. It . https://doi.org/10.1101/573782 doi: bioRxiv preprint meant that each split contained relatively equal representation of tissues. Before input into the 531 compression algorithms, we transformed the gene expression values by gene to the [0, 1] range 532 by subtracting the minimum value and dividing by the range for each specific gene. We applied 533 this transform independently for the testing and training partitions. We selected this range 534 because it was compatible with all of the algorithms. We used the training set to train each 535 compression algorithm. We used the scikit-learn implementations of PCA, ICA, and NMF, and 536 the Tybalt implementations of VAE and DAE [8,61]. 537 After learning optimized compression models with the training data, we transformed 538 the testing data using these models. We assessed performance metrics using both training and 539 testing data to reduce bias. In addition to training with real data, we also trained all models 540 with randomly permuted data. To permute the training data, we randomly shuffled the gene 541 expression values for all genes independently. We also transformed testing partition data with 542 models trained using randomly permuted data. Training with permuted data removes the 543 correlational structure in the data and can help set performance metric baselines. 544 One of our goals was to assess differences in performance and biological signal 545 detection across a range of latent dimensionalities (k). To this end, we trained all algorithms 546 with various k dimensionalities including k = 2, 3, 4, 5, 6, 7, 8, 9, 10, 12, 14, 16, 18, 20, 25, 30,  547   35, 40, 45, 50, 60, 70, 80, 90, 100, 125, 150, and 200 for a total of 28 different dimensionalities. 548 All of these models were trained independently. Lastly, for each k dimensionality we trained 549 five different models initialized with five different random seeds. In total, considering the three 550 datasets, five algorithms, randomly permuted training data, all 28 k dimensionalities, and five 551 . CC-BY 4.0 International license is made available under a The copyright holder for this preprint (which was not peer-reviewed) is the author/funder. It . https://doi.org/10.1101/573782 doi: bioRxiv preprint initializations, we trained 4,200 different compression models (Additional File 2: Figure S1). 552 Therefore, in total, we generated 185,100 different compression features. 553 554

Evaluating compression algorithm performance 555
We evaluated all compression algorithms on three main tasks: Reconstruction, sample 556 correlation, and weight matrix stability. First, we evaluated how well the input data is 557 reconstructed after passing through the bottleneck layer. Because the input data was 558 transformed to a distribution between 0 and 1, we used binary cross entropy to measure the 559 difference between algorithm input and output as a measure of reconstruction cost. The lower 560 the reconstruction cost, the higher fidelity reconstruction, and therefore the higher proportion 561 of signals captured in the latent space features. We also assessed the Pearson correlation of all 562 samples comparing input to reconstructed output. This value is similar to reconstruction and 563 can be quickly tracked at an individual sample level. Lastly, we used singular vector canonical 564 correlation analysis (SVCCA) to determine model stability within and model similarity between 565 algorithms and across latent dimensionalities [21]. The SVCCA method consisted of two distinct 566 steps. First, singular value decomposition (SVD) was performed on two input weight matrices. 567 The singular values that combined to reconstruct 98% of the signal in the data were retained. 568 Next, the SVD transformed weight matrix was input into a canonical correlation analysis (CCA). 569 CCA aligned different features in the weight matrix based on maximal correlation after learning 570 a series of linear transformations. Taken together, SVCCA outputs a single metric comparing 571 two input weight matrices that represents stability across model initializations and average 572 similarity of two different models. Because we used the weight matrices, the similarity 573 . CC-BY 4.0 International license is made available under a The copyright holder for this preprint (which was not peer-reviewed) is the author/funder. It . https://doi.org/10.1101/573782 doi: bioRxiv preprint describes signature discovery. We use the distribution of SVCCA similarity measures across all 574 pairwise algorithm initializations and latent dimensionalities to indicate model stability [21]. 575

576
Assessing gene expression signatures present in BioBombe features 577 We tested BioBombe sequentially compressed features to distinguish sample sex in 578 GTEx and TCGA data, and MYCN amplification in TARGET NBL data. We tested all compression 579 algorithms and latent space dimensionalities to determine the conditions in which these 580 features were best captured. First, we selected tissue-types and cancer-types in the GTEx and 581 TCGA sex analyses that were balanced by sex by selecting tissues with male to female ratios 582 between 0.5 and 1.5. We performed a two-tailed independent t-test assuming unequal 583 variance comparing male and female samples, and NBL samples with and without MYCN 584 amplification. We applied the t-test to all compression features identified across algorithms, 585 initializations, and dimensionalities. Shown in the figures are the top scoring feature per latent 586 space dimensionality and algorithm. 587 We applied the optimal MYCN signature learned in TARGET to an alternative dataset 588 consisting of a series of publicly available NBL cell lines [25]. The data were processed using 589 STAR, and we accessed the processed FPKM matrix from figshare [62]. We transformed the 590 dataset with the identified signatures using the following operation: 591 Where D represents the respective RNAseq data to transform, S represents the specific 593 signature, g' represents the overlapping genes measured in both datasets, n represents 594 samples, and D' s represents the signature scores in the transformed dataset. Of the 8,000 genes 595 measured in TARGET data, 7,653 were also measured in external NBL cell line dataset (95.6%). 596 Using the sample activation scores for each of the top scoring features for sample sex in 597 TCGA and GTEx, and MYCN amplification in TARGET and the validation set, we performed two 598

Rapid interpretation of compressed gene expression data 633
Our goal was to quickly interpret the automatically generated compressed latent 634 features learned by each unsupervised algorithm. To this end, we constructed gene set 635 adjacency matrices with specific MSigDB or xCell gene set collections using hetnetpy software. 636 We then performed the following matrix multiplication against a given compressed weight 637 matrix to obtain a raw score for all gene sets for each latent feature. 638 . CC-BY 4.0 International license is made available under a The copyright holder for this preprint (which was not peer-reviewed) is the author/funder. It . https://doi.org/10.1101/573782 doi: bioRxiv preprint . x ) * ) x 0 = . x 0 639 Where H represents the gene set adjacency matrix, c is the specific gene set collection, and n 640 represents genes. W represents the specific compression algorithm weight matrix, which 641 includes n genes and k latent space features. The output of this matrix multiplication, G, is 642 represented by c gene sets and k latent dimensions. Through a single matrix multiplication, the 643 matrix G tracks raw BioBombe scores. 644 Because certain hub genes are more likely to be implicated in gene sets and longer gene 645 sets will receive higher raw scores, we compared G to the distribution of permuted scores 646 against all 10 permuted networks. 647 Where H P 1-10 represents the adjacency matrices for all 10 permuted networks and G p represents 650 the distribution of scores for the same k features for all permutations. We calculated the z 651 score for all gene sets by latent features (G z-score ). This score represents the BioBombe Score. 652 Other network-based gene set methods consider gene set influence based on network 653 connectivity of gene set genes [53,54]. Instead, we used the latent feature weights derived 654 from unsupervised compression algorithms as input, and the compiled gene set networks to 655 assign biological function. 656 We also compared the BioBombe network projection approach to overrepresentation 657 analyses (ORA). We did not compare the approach to gene set enrichment analysis (GSEA) 658 because evaluating single latent features required many permutations and did not scale to the 659 . CC-BY 4.0 International license is made available under a The copyright holder for this preprint (which was not peer-reviewed) is the author/funder. It . https://doi.org/10.1101/573782 doi: bioRxiv preprint many thousands of compressed features we examined. We implemented ORA analysis using a 660 Fisher's Exact test. The background genes used in the test included only the genes represented 661 in the specific gene set collection. The copyright holder for this preprint (which was not peer-reviewed) is the author/funder. It . https://doi.org/10.1101/573782 doi: bioRxiv preprint

Downloading and processing publicly available expression data for neutrophil GTEx analysis 682
We used an external dataset to validate the neutrophil feature learned by compressing 683 GTEx gene expression data into three latent dimensionalities. We observed that this feature 684 contributed to improved reconstruction of blood tissue. To assess the performance of this 685 neutrophil signature, we downloaded data from the Gene Expression Omnibus ( Where D represents the processed RNAseq data from GSE103706. Of 8,000 genes measured in 700 W, 7,664 were also measured in D (95.8%). These 7,664 genes are represented by g'. All of the 701 "Neutrophils_HPCA_2" signature genes were measured in W. D' represents the GSE103706 702 data transformed along the specific compression feature. Each sample in D' is then considered 703 . CC-BY 4.0 International license is made available under a The copyright holder for this preprint (which was not peer-reviewed) is the author/funder. It . https://doi.org/10.1101/573782 doi: bioRxiv preprint transformed by the specific signature captured in k. The specific genes representing 704 "Neutrophils_HPCA_2" is provided in Additional File 3. 705 706 Downloading and processing publicly available expression data for monocyte GTEx analysis 707 We used an additional external dataset to validate the identified monocyte signature. 708 We accessed processed data for the publicly available GEO dataset with accession number 709 We trained supervised learning classifiers using raw RNA-seq features and BioBombe-722 derived features. In general, we trained supervised machine learning models to predict cancer 723 type from RNAseq features in TCGA PanCanAtlas RNAseq data. We implemented a logistic 724 regression classifier with an elastic net penalty. The classifiers were controlled for mutation 725 . CC-BY 4.0 International license is made available under a The copyright holder for this preprint (which was not peer-reviewed) is the author/funder. It . https://doi.org/10.1101/573782 doi: bioRxiv preprint burden. More details about the specific implementation are described in Way et al. 2018 [67]. 726 Here, we predicted all 33 cancer types using all 11,060 samples. These predictions were 727 independent per cancer type, which meant that we trained models with the same input gene 728 expression or BioBombe feature data, but used 33 different status matrices. 729 We also trained models to predict gene alteration status in the top 50 most mutated 730 genes in the PanCanAtlas. These models were controlled for cancer type and mutation burden. 731 We defined the status in this task using all non-silent mutations identified with a consensus 732 mutation caller [68]. We also considered large copy number amplifications for oncogenes and 733 deep copy number deletions for tumor suppressor genes as previously defined [69]. We used 734 the threshold GISTIC2.0 calls for large copy amplifications (score = 2) and deep copy deletions 735 (score = -2) in defining the status matrix [70]. For each gene alteration prediction, we removed 736 samples with a hypermutator phenotype, defined by having log10 mutation counts greater than 737 five standard deviations above the mean. For the mutation prediction task, we also did not 738 include certain cancer types in training. We omitted cancer types if they had less than 5% or 739 more than 95% representation of samples with the given gene alteration. The positive and 740 negative sets must have also included at least 15 samples. We filtered out cancer types in this 741 manner to prevent the classifiers from artificially detecting differences induced by unbalanced 742 training sets. 743 We trained models with raw RNAseq data subset by the top 8,000 most variably 744 expressed genes by median absolute deviation. The training data used was the same training 745 set used for the BioBombe procedure. We also trained models using all BioBombe compression 746 matrices for each latent dimension, and using real and permuted data. We combined 747 . CC-BY 4.0 International license is made available under a The copyright holder for this preprint (which was not peer-reviewed) is the author/funder. It The copyright holder for this preprint (which was not peer-reviewed) is the author/funder. It . https://doi.org/10.1101/573782 doi: bioRxiv preprint dimensionalities: k = 2, 4, 10, 16, 25, 50, 80, and 200. We conducted the time analysis using a 769 CPU machine with an Intel Core i3 dual core processer with 32 GB of DDR4 memory. The copyright holder for this preprint (which was not peer-reviewed) is the author/funder. It . https://doi.org/10.1101/573782 doi: bioRxiv preprint

Declarations: 791
Ethics approval and consent to participate 792 The TCGA, GTEx, and TARGET data used are publicly available and their use was 793 previously approved by their respective ethics committees. 794

Consent for publication 795
Not applicable. 796 Availability of data and material 797 All data used and results generated in this manuscript are publicly available. All results 798 of the analysis can be viewed and downloaded from 799 https://greenelab.github.io/BioBombe/. The analyzed data can be accessed in the 800 following locations: TCGA data can be accessed at https://gdc.cancer.gov/about-801 data/publications/pancanatlas, the GTEx data can be accessed at 802 https://gtexportal.org/home/datasets, the TARGET data can be accessed at 803 https://toil.xenahubs.net/download/target_RSEM_gene_fpkm.gz, the neutrophil 804 validation data can be accessed using gene expression omnibus (GEO) accession number 805 GSE103706 (https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE103706), the 806 monocyte validation data can be accessed using GEO accession number GSE24759 807 (https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE24759). Software to 808 reproduce the analyses, and all results generated in this manuscript can be accessed at 809 https://github.com/greenelab/biobombe. The software has also been archived in an 810 additional publicly available repository at https://zenodo.org/record/3460539. 811 Additionally, all BioBombe results are available in a series of versioned archives (TCGA 812 . CC-BY 4.0 International license is made available under a The copyright holder for this preprint (which was not peer-reviewed) is the author/funder. It . https://doi.org/10.1101/573782 doi: bioRxiv preprint We would like to thank Jaclyn Taroni, Yoson Park, and Alexandra Lee for insightful 833 discussions and code review. We also thank Jo Lynne Rokita and John Maris for 834 insightful discussions regarding the neuroblastoma analysis. 835 836 References: 837