Sequential compression of gene expression across dimensionalities and methods reveals no single best method or dimensionality

Background Unsupervised compression algorithms applied to gene expression data extract latent, or hidden, signals representing technical and biological sources of variation. However, these algorithms require a user to select a biologically-appropriate latent dimensionality. In practice, most researchers select a single algorithm and latent dimensionality. We sought to determine the extent by which using multiple dimensionalities across ensemble compression models improves biological representations. Results We compressed gene expression data from three large datasets consisting of adult normal tissue, adult cancer tissue, and pediatric cancer tissue. We compressed these data into many latent dimensionalities ranging from 2 to 200. We observed various tradeoffs across latent dimensionalities and compression models. For example, we observed high model stability between principal components analysis (PCA), independent components analysis (ICA), and non-negative matrix factorization (NMF). We identified more unique biological signatures in ensembles of denoising autoencoder (DAE) and variational autoencoder (VAE) models in intermediate latent dimensionalities. However, we captured the most pathway-associated features using all compressed features across algorithms and dimensionalities. Optimized at different latent dimensionalities, compression models detect generalizable gene expression signatures representing sex, neuroblastoma MYCN amplification, and cell types. In two supervised machine learning tasks, compressed features optimized predictions at different latent dimensionalities. Conclusions There is no single best latent dimensionality or compression algorithm for analyzing gene expression data. Instead, using feature ensembles from different compression models across latent space dimensionalities optimizes biological representations.


Abstract: 23
Background 24 Unsupervised machine learning algorithms applied to gene expression data extract 25 latent, or hidden, signals representing technical and biological sources of variation. However, 26 these algorithms require a user to select a biologically-appropriate latent dimensionality. 27

Results 28
We compressed gene expression data from three large transcriptomic datasets 29 consisting of adult normal tissue, adult cancer tissue, and pediatric cancer tissue. Rather than 30 selecting a single latent dimensionality, we sequentially compressed these data into many 31 dimensions ranging from 2 to 200. We trained principal components analysis (PCA), 32 independent components analysis (ICA), non-negative matrix factorization (NMF), denoising 33 autoencoder (DAE), and variational autoencoder (VAE) models. We observed various tradeoffs 34 for each model. For example, we observed high model stability between PCA, ICA, and NMF 35 algorithms across latent dimensionalities. We identified more unique biological signatures in 36 DAE and VAE model ensembles in intermediate latent dimensionalities. However, we captured 37 the most pathway-associated features using all compressed features across algorithms, 38 ensembles, and dimensions. We also used multiple latent dimensionalities to optimize gene 39 expression signatures representing sample sex, neuroblastoma MYCN amplification, and 40 various blood cell types, which generalized to external datasets. In supervised machine learning 41 tasks, compressed features predicted cancer type and gene alteration status. In this setting, the 42 best performing supervised models used features from different dimensionalities and 43 compression algorithms indicating that there was no single best dimensionality or compression 44 algorithm. 45

Conclusions 46
Ensembles of features from different unsupervised algorithms discover biological 47 signatures in large transcriptomic datasets. To enhance biological signature discovery, rather 48 than compressing input data into a single pre-selected dimensionality, it is best to perform 49 compression on input data over many latent dimensionalities. 50 51

Introduction: 52
Dimensionality reduction algorithms compress input data into feature representations 53 that capture major sources of variation. Applied to gene expression data, compression 54 algorithms identify latent biological and technical processes. These processes reveal important 55 information about the samples and can help to generate hypotheses that are difficult or 56 impossible to observe in the original genomic space. For example, applying PCA to a large 57 cancer transcriptomic compendium determined the influence of copy number alterations in 58 gene expression measurements [1]. Applying ICA to transcriptome data aggregated gene 59 modules representing core pathways and hidden transcriptional programs [2,3]. Training NMF 60 models using bulk gene expression data estimated cell type proportion [4,5]. DAEs have 61 revealed latent signals characterizing oxygen exposure and transcription factor targets [6,7], 62 and VAEs have identified biologically relevant latent features discriminating cancer subtypes 63 and drug response [8,9]. Nevertheless, a major challenge to all compression applications is the 64 We compressed RNAseq data from TCGA, GTEx, and TARGET using PCA, ICA, NMF, DAE, 87 and VAE across 28 different latent dimensions (k) ranging from k = 2 to k = 200. We split each 88 dataset into 90% training and 10% test sets balanced by cancer type or tissue type and trained 89 models using only the training data. We used real and permuted data and initialized each 90 model five times per latent dimension resulting in a total of 4,200 different compression 91 models (Additional File 1: Figure S1). We evaluated hyperparameters for DAE and VAE models 92 across dimensions and trained models using optimized parameter settings (Additional File 2; 93 Additional File 1: Figure S2). See Fig. 1 for an outline of our approach. We provide full 94 BioBombe analysis results for all compression models across datasets for both real [13][14][15] and 95 permuted data [16][17][18] in both training and test sets as publicly available resources. 96 97 98 Figure 1: Overview of the BioBombe approach. We implemented BioBombe on three datasets 99 using five different algorithms. We sequentially compressed input data into various latent 100 dimensionalities. We calculated various metrics that describe different benefits and trade-offs of 101 the algorithms. Lastly, we implemented a network projection approach to interpret the 102 compressed latent features. We used MSigDB collections and xCell gene sets to interpret 103 compressed features. 104 105 Assessing compression algorithm reconstruction 106 Reconstruction cost, a measurement of the difference between the input and output 107 matrices, is often used to describe the ability of compression models to capture fundamental 108 processes in latent space features that recapitulate the original input data. We tracked the 109 reconstruction cost for the training and testing data partitions for all datasets, algorithms, 110 latent dimensions, and random initializations. As expected, we observed lower reconstruction 111 costs in models trained with real data and with higher latent dimensions (Additional File 1: 112 Figure S3). Because PCA and ICA are rotations of one another, we used the identical scores as a 113 positive control. All compression algorithms had similar reconstruction costs, with the highest 114 variability at low latent dimensions (Additional File 1: Figure S3). 115

116
Evaluating model stability and similarity within and across latent dimensions 117 We applied singular vector canonical correlation analysis (SVCCA) to algorithm weight 118 matrices to assess model stability within algorithm initializations, and to determine model 119 similarity between algorithms [19]. Briefly, SVCCA calculates similarity between two 120 compression algorithm weight matrices by learning appropriate linear transformations and 121 iteratively matching the highest correlating features. Training with TCGA data, we observed 122 highly stable models within algorithms and within all latent dimensionalities for PCA, ICA, NMF 123 (along the matrix diagonal in Fig 2a). VAE models were also largely stable, with some decay in 124 higher latent dimensions. However, DAE models were unstable, particularly at low latent 125 dimensions (Fig 2a). We also compared similarity across algorithms. Because PCA and ICA are 126 rotations of one another, we used the high stability as a positive control for SVCCA estimates. 127  upper triangle represents SVCCA applied to real gene expression data, while the lower triangle  134 represents permuted expression data. Both real and permuted data are plotted along the 135 diagonal. (b) Mean correlations of all iterations within algorithms but across k dimensions. SVCCA 136 will identify min(i, j) canonical vectors for latent dimensions ki and kj. The mean of all pairwise 137 correlations is shown for all combinations of k dimensions. 138 139 NMF was also highly similar to PCA and ICA, particularly at low latent dimensions (Fig. 2a). VAE 140 models were more similar to PCA, ICA, and NMF than DAE models, particularly at low latent 141 dimensions, and the instability patterns within DAE models also lead to large differences across 142 algorithms (Fig. 2a). We observed similar patterns in GTEx and TARGET data, despite TARGET 143 containing only about 700 samples (Additional File 1: Figure S4). 144 We also used SVCCA to compare the similarity of weight matrices across latent 145 dimensions. Both PCA and ICA found highly similar solutions across all dimensions (Fig. 2b) and latent dimensions. We optimally identified this phenotype in higher latent dimensions, 160 particularly in VAE and NMF models (Fig. 3a). The top feature separating GTEx males and 161 females was VAE feature 108 in k = 200 (t = 49.0, p = 2.7 x 10 -285 ) (Fig 3b). We performed the 162 same approach using BioBombe features in TCGA data. Whereas the largest models appeared 163 to capture sex optimally in GTEx data, intermediate latent dimensions best captured sex in 164 TCGA data (Fig. 3c). The top latent dimension identified was not consistent across algorithms. 165 The top feature distinguishing TCGA males and females was VAE feature 16 in the k = 20 model 166 (t = -13.9, p = 1.8 x 10 -40 ) (Fig. 3d). 167 We also tested the ability of BioBombe to distinguish MYCN amplification in 168 neuroblastoma (NBL) tumors. MYCN amplification is a biomarker associated with poor 169 prognosis in NBL patients [22]. Using latent features derived from the full TARGET data, we 170 performed a two-tailed t-test comparing MYCN amplified vs. MYCN not amplified NBL tumors. 171 Each algorithm discovered optimal signal at various latent dimensions, but the best feature was 172 identified in VAE models at k = 200 (Fig. 3e). Although there were some potentially 173 mischaracterized samples, feature 111 in VAE k = 200 robustly separated MYCN amplification 174 status in NBL tumors (t = 17.5, p = 3.0 x 10 -37 ) (Fig. 3f). This feature also distinguished MYCN 175 amplification status in NBL cell lines [23] that were previously not used for training by the 176 compression model or for feature selection (t = 2.9, p = 7.1 x 10 -3 ) (Fig. 3g). targets, cancer modules, and Reactome pathways across latent dimensions in TCGA data (Fig.  196   4). In all cases, we observed higher gene set coverage in models with larger latent 197 dimensionalities. Considering individual models, we observed high coverage in PCA, ICA, and 198 NMF. In particular, ICA outperformed all other algorithms (Fig. 4a). However, while these 199 methods showed the highest coverage, the features identified had relatively low enrichment 200 scores compared to AE models (Additional File 1: Figure S6). 201 Aggregating all five random initializations into ensemble models, we observed 202 substantial coverage increases, especially for AEs (Fig. 4b). VAE models had high coverage for all 203 gene sets in intermediate dimensions, while DAE improved in higher dimensions. However, at 204 the highest dimensions, ICA demonstrated the highest coverage. NMF consistently had the 205 highest enrichment scores, but the lowest coverage (Fig. 4b). When considering all models 206 combined (forming an ensemble of algorithm ensembles) within latent dimensionalities, we 207 observed substantially increased coverage of all gene sets. However, most of the unique gene 208 sets were contributed by the AE models (Fig. 4c). Lastly, when we aggregated all BioBombe 209 features across all algorithms and all latent dimensions together into a single model, we 210 observed the highest gene set coverage (Fig. 4c). These patterns were consistent across other 211 We measured the Pearson correlation between all samples' gene expression input and 231 reconstructed output. As expected, we observed increased mean correlation and decreased 232 variance as the latent dimensions increased in TCGA data ( Fig. 5a). We also observed similar 233 patterns in GTEx and TARGET data (Additional File 1: Figure S9). Across all datasets, in 234 randomly permuted data, we observed correlations near zero (Additional File 1: Figure S9). The 235 correlation with real data was not consistent across all algorithms as PCA, ICA, and NMF 236 generally outperformed the AE models. 237 We tracked correlation differences across latent dimensionalities to determine the 238 dimension at which specific sample types are initially detected. Most cancer types, including 239 breast invasive carcinoma (BRCA) and colon adenocarcinoma (COAD), displayed relatively 240 gradual increases in sample correlation as the latent dimensionality increased (Fig. 5b). 241 However, in other cancer types, such as low grade glioma (LGG), pheochromocytoma and 242 paraganglioma (PCPG), and acute myeloid leukemia (LAML), we observed large correlation 243 gains with a single increase in latent dimension (Fig. 5c). We also observed similar performance 244 spikes in GTEx data for several tissues including liver, pancreas, and blood (Fig. 5d). This sudden 245 and rapid increase in correlation in specific tissues occurred at different latent dimensions for 246 different algorithms, but was consistent across algorithm initializations. We more closely examined the sharp increase in GTEx blood tissue correlation between 255 latent space dimensions 2 and 3 in VAE models (See Fig. 5d). We hypothesized that a difference 256 in reconstruction for a specific tissue at such a low dimensionality could be driven by a change 257 in the cell types captured by the model. We applied network projection of xCell gene sets to all 258 compressed features in both VAE models. xCell gene sets represent computationally derived 259 cell type signatures [25]. The top features identified for the VAE k = 2 model included skeletal 260 muscle, keratinocyte, and neuronal gene sets (Fig. 6a). Skeletal muscle was the most significant 261 gene set identified likely because it the tissue with the most samples in GTEx. Similar gene sets 262 were enriched in the k = 3 model, but we also observed enrichment for a specific neutrophil 263 gene set ("Neutrophils_HPCA_2") ( Fig. 6a). Neutrophils represent 50% of all blood cell types, 264 which may explain the increased correlation in blood tissue observed in VAE k = 3 models. The 265 features implicated using the network projection approach were similar to an 266 overrepresentation analysis using high weight genes in both tails of the VAE k = 3 feature 267 (Additional File 1: Figure S10). 268 We also calculated the mean absolute value z scores for xCell gene sets in all 269 compression features for both VAE models with k = 2 and k = 3 dimensions (Fig. 6b). Again, we 270 observed skeletal muscle, keratinocytes, and neuronal gene sets to be enriched in both models. 271 However, we also observed a cluster of monocyte gene sets (including 272 "Monocytes_FANTOM_2") with enrichment in k = 3, but low enrichment in k = 2 (Fig. 6b). 273 Monocytes are also important cell types found in blood, and it is probable these signatures also 274 contributed to the increased correlation for the reconstructed blood samples in VAE k = 3 275 models. We provide the full list of xCell gene set genes for the neutrophil and monocyte gene 276 sets that intersected with the GTEx data in Additional File 3. We scanned all other algorithms and latent dimensions to identify other compression 291 features with high enrichment scores in the "Neutrophils_HPCA_2" (Fig. 6c) and 292 "Monocytes_FANTOM_2" gene sets (Fig. 6d). We observed stronger enrichment of the 293 "Neutrophil_HPCA_2" gene set in AE models compared to PCA, ICA, and NMF, especially at 294 lower latent dimensions. We observed the highest score for the "Neutrophil_HPCA_2" gene set 295 at k = 14 in VAE models (Fig. 6c). The top VAE feature at k = 14 correlated strongly with the VAE 296 feature learned at k = 3 (Additional File 1: Figure S10). Conversely, PCA, ICA, and NMF 297 identified the "Monocytes_FANTOM_2" signature with higher enrichment than the AE models 298 (Fig. 6d). We observed a performance spike at k = 7 for both PCA and NMF models, but the 299 highest enrichment for "Monocytes_FANTOM_2" occurred at k = 200 in NMF models. 300 301

Validating GTEx neutrophil and monocyte signatures in external datasets 302
We downloaded a processed gene expression dataset (GSE103706) that applied two 303 treatments to induce neutrophil differentiation in two leukemia cell lines [27]. We hypothesized 304 that projecting the dataset on the "Neutrophil_HPCA_2" signature would reveal differential 305 scores in the treated cell lines. We observed large differences in sample activations of treated 306 vs untreated cell lines in the top Neutrophil signature (VAE k = 14) (Fig. 6e). We also tested the 307 "Monocytes_FANTOM_2" signature on a different publicly available dataset (GSE24759) 308 measuring gene expression of isolated cell types undergoing hematopoiesis [28]. We observed 309 increased scores for isolated monocyte cell population (MONO2) and relatively low scores for 310 several other cell types for top VAE features (Fig. 6f). 311 We applied the top signatures for the neutrophil and monocyte gene sets to each 312 external dataset (see Fig. 6c, d). We observed variable enrichment patterns across different 313 algorithms and latent dimensionalities (Additional File 1: Figure S11a). These separation 314 patterns were associated with network projection scores in NMF models, but were not 315 consistent with other algorithms (Additional File 1: Figure S11b). Taken together, in this 316 analysis we determined that 1) adding a single latent dimension that captured Neutrophil and 317 Monocyte signatures improved signal detection in GTEx blood, 2) these gene expression 318 signatures are enhanced at different latent dimensionalities and by different algorithms, and 3) 319 these signatures generalized to external datasets that were not encountered during model 320 Nearly all cancer types could be predicted with high precision and recall (Additional File 1: 327 Figure S12). We observed multiple performance spikes at varying latent dimensionalities for 328 different cancer types and algorithms, which typically occurred in small latent dimensions (Fig.  329   7a). Next, we input BioBombe features into the supervised classifier to predict samples with 330 alterations in the top 50 most mutated genes in TCGA (Additional File 1: Figure S13). We 331 focused on predicting four cancer genes and one negative control; TP53, PTEN, PIK3CA, KRAS, 332 and TTN (Fig. 7b). TTN is a particularly large gene and is associated with a high passenger 333 mutation burden and should provide no predictive signal [29]. As expected, we did not observe 334 any signal in predicting TTN (Fig. 7b). Again, we observed performance increases at varying 335 latent dimensionalities across algorithms. However, predictive signal for mutations occurred at 336 higher latent dimensions compared to cancer types (Fig. 7c). Compared to features trained 337 within algorithm and within iteration, an ensemble of five VAE models and an ensemble of five 338 models representing one iteration of each algorithm (PCA, ICA, NMF, DAE, and VAE), identified 339 cancer type and mutation status in earlier dimensions compared to single model iterations (Fig  340   7c). We also tracked the logistic regression coefficients assigned to each compression feature. 341 DAE models consistently displayed sparse models, and the VAE ensemble and model ensemble 342 also induced high sparsity (Fig. 7d). 343 Lastly, we trained logistic regression classifiers using all 30,850 BioBombe features 344 generated across iterations, algorithms, and latent dimensions. These models were sparse and 345 high performing; comparable to logistic regression models trained using raw features (Fig. 7e). of the signal (Fig. 7f). 358

Discussion: 359
Our primary observation is that compressing complex gene expression data using 360 multiple latent dimensionalities and algorithms enhances biological signature discovery. Across 361 multiple latent dimensionalities, we identified optimal features to stratify sample sex, MYCN 362 amplification, blood cell types, cancer types, and mutation status. Furthermore, the complexity 363 of biological features was associated with the number of latent dimensions used. We predicted 364 gene mutation using models with high dimensionality, but we detected cancer type with high 365 accuracy using models with low dimensionality. In general, unsupervised learning algorithms 366 applied to gene expression data extract biological and technical signals present in input 367 samples. When applying these algorithms, researchers must determine how many latent 368 dimensions to compress their input data into and different studies can have a variety of goals. 369 For example, compression algorithms used for visualization can stratify sample groups based on 370 the largest sources of variation [30][31][32][33][34][35]. In visualization settings, selecting a small number of 371 latent dimensions is often best, and there is no need for sequential compression. However, if 372 the analysis goal includes learning biological signatures to identify more subtle patterns in input 373 samples, then there is not a single optimal latent dimensionality nor optimal algorithm. While 374 compressing data into a single latent dimension will capture many biological signals, the 375 "correct" dimension is not always clear, and several biological signatures may be better 376 revealed in alternative latent dimensions. 377 However, rather than using heuristics to select a biologically-appropriate latent dimension, a 408 researcher may instead elect to compress gene expression data into many different latent 409 space dimensionalities to generate many different feature representations. 410 There are many limitations to our approach and analysis. First, our approach takes a 411 long time to run. We are training many different algorithms across many different latent 412 dimensions and iterations, which requires a lot of compute time. However, because we are 413 training many models independently, this task can be parallelized. Additionally, we did not 414 evaluate dimensions above k = 200. It is likely that many more signatures can be learned, and 415 possibly with even higher association strengths in higher dimensions for certain biology. We 416 also do not have a mechanism to detect compressed features that represent technical artifacts. Independently for each dataset (TCGA, GTEx, and TARGET), we performed the following 496 procedure to train the compression algorithms. First, we randomly split data into 90% training 497 and 10% testing partitions. We balanced each partition by cancer type or tissue type, which 498 meant that each split contained relatively equal representation of tissues. Before input into the 499 compression algorithm, we transformed the gene expression values by gene to a range 500 between 0 and 1 independently for the testing and training partitions. We used the training set 501 to train each compression algorithm. We used the scikit-learn implementations of PCA, ICA, and 502 NMF, and the Tybalt implementations of VAE and DAE [8,54]. 503 After learning optimized compression models with the training data, we transformed 504 the testing data using these models. We assessed performance metrics using both training and 505 testing data to reduce bias. In addition to training with real data, we also trained all models 506 with randomly permuted data. To permute the training data, we randomly shuffled the gene 507 expression values for all genes independently. We also transformed testing partition data with 508 models trained using randomly permuted data. Training with permuted data removes the 509 correlational structure in the data and can help set performance metric baselines. 510 One of our goals was to assess differences in performance and biological signal 511 detection across a range of latent dimensionalities (k). To this end, we trained all algorithms 512 with various k dimensionalities including k = 2, 3, 4, 5, 6, 7, 8, 9, 10, 12, 14, 16, 18, 20, 25, 30, 513 35, 40, 45, 50, 60, 70, 80, 90, 100, 125, 150, and 200 for a total of 28 different dimensions. All of 514 these models were trained independently. Lastly, for each k dimension we trained five different 515 models initialized with five different random seeds. In total, considering the three datasets, five 516 algorithms, randomly permuted training data, all 28 k dimensions, and five initializations, we 517 trained 4,200 different compression models (Additional File 2: Figure S1). Therefore, in total, 518 we generated 185,100 different compression features. 519 520

Evaluating compression algorithm performance 521
We evaluated all compression algorithms on three main tasks: Reconstruction, sample 522 correlation, and weight matrix stability. First, we evaluated how well the input data is 523 reconstructed after passing through the bottleneck layer. Because the input data was 524 transformed to a distribution between 0 and 1, we used binary cross entropy to measure the 525 difference between algorithm input and output as a measure of reconstruction cost. The lower 526 the reconstruction cost, the higher fidelity reconstruction, and therefore the higher proportion 527 of signals captured in the latent space features. We also assessed the Pearson correlation of all 528 samples comparing input to reconstructed output. This value is similar to reconstruction and 529 can be quickly tracked at an individual sample level. Lastly, we used singular vector canonical 530 correlation analysis (SVCCA) to determine model stability within and model similarity between 531 algorithms and across latent dimensions [19]. The SVCCA method consisted of two distinct 532 steps. First, singular value decomposition (SVD) was performed on two input weight matrices. 533 The singular values that combined to reconstruct 98% of the signal in the data were retained. 534 Next, the SVD transformed weight matrix was input into a canonical correlation analysis (CCA). 535 CCA aligned different features in the weight matrix based on maximal correlation after learning 536 a series of linear transformations. Taken together, SVCCA outputs a single metric comparing 537 two input weight matrices that represents stability across model initializations and average 538 similarity of two different models. Because we used the weight matrices, the similarity 539 describes signature discovery. We use the distribution of SVCCA similarity measures across all 540 pairwise algorithm initializations and latent dimensionalities to indicate model stability [19]. 541 542 Using BioBombe as a signature discovery tool 543 We tested the ability of BioBombe sequentially compressed features to distinguish 544 sample sex in GTEx and TCGA data, and MYCN amplification in TARGET NBL data. We 545 performed a two-tailed independent t-test assuming equal variance comparing male and 546 female samples, and NBL samples with and without MYCN amplification. We applied the t-test 547 to all compression features identified across algorithms, initializations, and dimensions. Shown 548 in the figures are the top scoring feature per latent space dimension and algorithm. 549 We applied the optimal MYCN signature learned in TARGET to an alternative dataset 550 consisting of a series of publicly available NBL cell lines [23]. The data were processed using 551 STAR, and we accessed the processed FPKM matrix from figshare [55]. We transformed the 552 dataset with the identified signatures using the following operation: 553 "# $ * "# x ) = + , ) #

554
Where D represents the respective RNAseq data to transform, S represents the specific 555 signature, g' represents the overlapping genes measured in both datasets, n represents 556 samples, and D's represents the signature scores in the transformed dataset. Of the 8,000 genes 557 measured in TARGET data, 7,653 were also measured in external NBL cell line dataset (95.6%). Our goal was to quickly interpret the automatically generated compressed latent 590 features learned by each unsupervised algorithm. To this end, we constructed gene set 591 adjacency matrices with specific MSigDB or xCell gene set collections using hetio software. We 592 then performed the following matrix multiplication against a given compressed weight matrix 593 to obtain a raw score for all gene sets for each latent feature. 594 . x ) * ) x 0 = . x 0 595 Where H represents the gene set adjacency matrix, c is the specific gene set collection, and n 596 represents genes. W represents the specific compression algorithm weight matrix, which 597 includes n genes and k latent space features. The output of this matrix multiplication, G, is 598 represented by c gene sets and k latent dimensions. Through a single matrix multiplication, the 599 matrix G tracks raw BioBombe scores. 600 Because certain hub genes are more likely to be implicated in gene sets and longer gene 601 sets will receive higher raw scores, we compared G to the distribution of permuted scores 602 against all 10 permuted networks. ( 2 ) 605 Where HP 1-10 represents the adjacency matrices for all 10 permuted networks and Gp represents 606 the distribution of scores for the same k features for all permutations. We calculated the z 607 score for all gene sets by latent features (Gz-score). This score represents the BioBombe Score. 608 Other network-based gene set methods consider gene set influence based on network 609 connectivity of gene set genes [46,47]. Instead, we used the latent feature weights derived 610 from unsupervised compression algorithms as input, and the compiled gene set networks to 611 assign biological function. 612 We also compared the BioBombe network projection approach to overrepresentation 613 analyses (ORA). We did not compare the approach to gene set enrichment analysis (GSEA) 614 because evaluating single latent features required many permutations and did not scale to the 615 many thousands of compressed features we examined. We implemented ORA analysis using a 616 Fisher's Exact test. The background genes used in the test included only the genes represented 617 in the specific gene set collection. 618

619
Calculating gene set coverage of sequentially compressed gene expression data 620 We were interested in determining the proportion of gene sets within gene set 621 collections that were captured by the features derived from various compression algorithms. 622 We considered a gene set "captured" by a compression feature if it had the highest positive or 623 highest negative BioBombe z score compared to all other gene sets in that collection. We 624 converted BioBombe z scores into p values using the pnorm() R function using a two-tailed test. The PLB-985 cell line was previously identified as a subclone of HL-60, so we expect similar 644 signature activity between the two lines [58]. Gene expression of the two cell lines was 645 measured with and without neutrophil differentiation treatments. Though DMSO is frequently 646 used to solubilize compounds and act as an experimental control, it has been used to create 647 neutrophil-like cells [59], and the dataset we used was generated to compare this activity with 648 untreated and DMSO with Nutridoma [27]. We tested the hypothesis that our neutrophil 649 signature would distinguish the samples with and without neutrophil differentiation treatment. 650 We transformed external datasets with the following operation: 651 Where D represents the processed RNAseq data from GSE103706. Of 8,000 genes measured in 653 W, 7,664 were also measured in D (95.8%). These 7,664 genes are represented by g'. All of the 654 "Neutrophils_HPCA_2" signature genes were measured in W. D' represents the GSE103706 655 data transformed along the specific compression feature. Each sample in D' is then considered 656 transformed by the specific signature captured in k. The specific genes representing 657 "Neutrophils_HPCA_2" is provided in Additional File 3. 658 659 Downloading and processing publicly available expression data for monocyte GTEx analysis 660 We used an additional external dataset to validate the identified monocyte signature. We trained supervised machine learning models to predict cancer type from RNAseq 675 features in TCGA PanCanAtlas RNAseq data. We implemented a logistic regression classifier 676 with an elastic net penalty. The classifiers are controlled for mutation burden. More details 677 about the specific implementation are described in Way et al. 2018 [60]. Here, we predicted all 678 33 cancer types using all 11,060 samples. These predictions were independent per cancer type, 679 which meant that we trained models with the same input gene expression data, but used 33 680 different status matrices. 681 We also trained models to predict gene alteration status in the top 50 most mutated 682 genes in the PanCanAtlas. These models are controlled for cancer type and mutation burden. 683 We defined the status in this task using all non-silent mutations identified with a consensus 684 mutation caller [61]. We also considered large copy number amplifications for oncogenes and 685 deep copy number deletions for tumor suppressor genes as previously defined [62]. We used 686 the threshold GISTIC2.0 calls for large copy amplifications (score = 2) and deep copy deletions 687 (score = -2) in defining the status matrix [63]. For each gene alteration prediction, we removed 688 samples with a hypermutator phenotype, defined by having log10 mutation counts greater than 689 five standard deviations above the mean. For the mutation prediction task, we also did not 690 include certain cancer types in training. We omitted cancer types if they had less than 5% or 691 more than 95% representation of samples with the given gene alteration. The positive and 692 negative sets must have also included at least 15 samples. We filtered out cancer types in this 693 manner to avoid the classifiers from artificially detecting differences induced by unbalanced 694 training sets. 695 We trained models with raw RNAseq data subset by the top 8,000 most variably 696 expressed genes by median absolute deviation. The training data used was the same training 697 set used for the sequential compression procedure. We also trained models using all 698 compression matrices for each latent dimension, and using real and permuted data. We 699 combined compressed features together to form three different types of ensemble models. The 700 first type grouped all five iterations of VAE models per latent dimension to make predictions. 701 The second type grouped features of five different algorithms (PCA, ICA, NMF, DAE, VAE) of a 702 single iteration together to make predictions. The third ensemble aggregated all features 703 learned by all algorithms, all initializations, and across all latent dimensions, which included a 704 total of 30,850 features. In total, considering the 33 cancer types, 50 mutations, 28 latent 705 dimensions, ensemble models, raw RNAseq features, real and permuted data, and 5 706 initializations per compression, we trained and evaluated 32,868 different supervised models. 707 We optimized all of the models independently using 5-fold cross validation (CV). We 708 searched over a grid of elastic net mixing and alpha hyperparameters. The elastic net mixing 709 parameter represents the tradeoff between l1 and l2 penalties (where mixing = 0 represents an 710