Matrix factorization recovers consistent regulatory signals from disparate datasets

The availability of gene expression data has dramatically increased in recent years. This data deluge could result in detailed inference of underlying regulatory networks, but the diversity of experimental platforms and protocols introduces critical biases that could hinder scalable analysis of existing data. Here, we show that the underlying structure of the E. coli transcriptome, as determined by Independent Component Analysis (ICA), is conserved across multiple independent datasets, including both RNA-seq and microarray datasets. We also show that echoes of this structure remain in the proteome, accelerating biological discovery through multi-omics analysis. We subsequently combined five transcriptomics datasets into a large compendium containing over 800 expression profiles and discovered that its underlying ICA-based structure was still comparable to that of the individual datasets. ICA thus enables deep analysis of disparate data to uncover new insights that were not visible in the individual datasets.


Introduction
Publicly available datasets, such as the NCBI Gene Expression Omnibus (GEO) 1 and Array Express 2 , contain thousands of transcriptomics datasets that are often designed and 30 analyzed for a specific study. Historically, microarrays were the platform of choice for transcriptomic interrogation. Over the past decade, usage of RNA sequencing (RNA-seq) has surpassed microarrays due to its higher sensitivity and ability to detect new transcripts 3 . Public repositories for proteomics datasets have introduced additional reusable expression datasets 4,5 .

35
Multiple consortia have performed extensive comparisons of expression levels across different microarray and RNA-seq platforms [6][7][8] . These studies showed that absolute gene expression levels cannot be accurately measured by either expression profiling technique, whereas relative abundances are consistent across a wide range of transcriptomics platforms, with appropriate quality controls. However, transcript levels alone cannot predict protein 40 expression levels 9 To further complicate matters, batch effects and technical heterogeneity continue to present significant challenges to successful integration of omics datasets 10 .
Differential expression analysis is the most common analytical method applied to transcriptomics datasets. However, differential expression analysis is limited in dimensionality, 45 interpretability, and reproducibility; it can only be applied to pairs of experimental conditions, requires additional analysis to interpret large swaths of differentially expressed genes 11,12 , and is highly dependent on the quantification pipeline 13, 14 . Alternatively, machine learning methods, especially matrix factorization 15,16 , have provided new tools for extracting low-dimensional biological information from large omics data. 50 In particular, independent component analysis (ICA) has been shown to extract biologically significant gene sets from many transcriptomics datasets [17][18][19][20][21][22] and two proteomics datasets 23 . ICA outperformed 42 module detection methods in a comprehensive examination across 5 organisms 24 . Previously, we applied ICA to a high-quality Escherichia coli gene 55 expression compendium to extract 92 independently-modulated groups of genes (called imodulons). Of these 92 i-modulons, 61 represented transcriptional regulators and their compendium-wide activities. An additional 25 i-modulons were linked to biological functions or genetic perturbations, leaving only 6 uncharacterized. I-modulons have provided clear biological explanations for transcriptional changes in significantly perturbed cells 25,26 . Other research groups have successfully identified co-regulated gene sets using ICA on human transcriptomic datasets 19,22 , but characterization of some components was hindered due to the high fraction of unknown human genes compared to the model bacterium E. coli 27,28 .
A recent study showed that among dimensionality reduction methods, ICA produced the 65 most similar components across 14 independent cancer microarray datasets 29 . In this study we show that consistent regulatory components can be identified in expression datasets spanning disparate experimental conditions, and that these components are robust to dataset integration.
Through analysis of five independent E. coli transcriptomics datasets and two independent proteomics datasets, we identify a coherent structure without requiring batch normalization 70 procedures. In addition, integrated analysis of the different datasets demonstrated compelling evidence towards regulon discovery. These results present ICA as a promising tool to integrate and understand the flood of omics data challenging scientists today.

datasets
We compiled two RNA-seq and three microarray datasets, each using a different expression profiling technology or generated from a different research group (Table 1, Supplementary Data 1). Each dataset was independently processed, centered, and decomposed with ICA (See Methods). This process generated a set of independent components (ICs) for each 80 dataset (Figure 1a, Supplementary Data 2). Each IC contains a coefficient for every gene, although most gene coefficients were near zero for a given IC. To understand the biological role of each IC, we selected genes with sufficiently non-zero coefficients and refer to this set of genes as an "i-modulon" 30 . ICs, and their corresponding i-modulons, represent the underlying structure of the transcriptome under any experimental condition in the database. The condition-dependent 85 dynamics of gene expression are captured by the activities of the ICs (also referred to as imodulon activities, Supplementary Data 3). In this study, we focused on the condition-invariant structure of the transcriptome (i.e. ICs and their resultant i-modulons), to determine if transcriptome structure is conserved across platforms. We categorized the resulting i-modulons from each dataset into four classes, based on 95 their gene content: (1) Regulatory, (2) Functional, (3) Genomic, and (4) Uncharacterized ( Figure   1b,c, Supplementary Data 4). Prior work showed that i-modulons are highly consistent with, but not identical to, known regulons. On average, these i-modulons captured 80% of the known targets of their linked transcriptional regulators and have accurately predicted new regulon members 30 . Although i-modulons often contain genes known to be regulated by a single 100 transcriptional regulator (e.g. transcription factor, sigma factor, or riboswitch), it has been observed that i-modulons may represent combinations of multiple regulons 19,30 .

90
In this study, a Regulatory i-modulon was defined as an i-modulon that was statistically enriched with a single regulon (Fisher's exact test, FDR < 10 -5 ). Regulatory i-modulons were 105 named after the single transcriptional regulator whose targets provided the best overlap (See Methods).    Table 1). This example demonstrates that an i-modulon and its IC gene coefficients are unaffected by technology or dataset composition, given a diverse enough dataset. To extend our assessment of i-modulons reproducibility, we compared all i-modulons found in each dataset using a Reciprocal Best Hit (RBH) graph 29 (Figure 3a, Supplementary Table   2). In the RBH graph, each node represents an i-modulon, and nodes are connected when imodulons from two different datasets find each other as the best scoring i-modulon in the other dataset 43 . I-modulons were scored by the absolute correlation between the gene coefficients in 185 their respective ICs, as shown for the CysB i-modulon cluster. For clarity, edges with low scores were removed from the graph (See methods, Supplementary Figure 1).

Data integration reveals increased resolution and novel insights
I-modulons are a linear approximation of the structure underlying transcriptomics datasets.
Since elements of this structure were highly consistent across the five transcriptomics datasets, 230 we combined the datasets together to explore whether ICA would identify a similar structure in the combined compendium, which contained 802 expression profiles (Figure 4a).
We extracted 181 i-modulons from the combined compendium and characterized them as   Figure 5d). Therefore, we named it the "Central Dogma" i-modulon. Over half of the genes in this i-modulon are not known to be regulated 280 by any transcription factor, and the remaining genes are not enriched in any common regulator.
A recent study found that binding of ppGpp, the stringent response alarmone 45  Next, we investigated whether dataset integration could change the quality of the 290 Regulatory i-modulons. Since a Regulatory i-modulon consists of genes in a known regulon, the quality of a Regulatory i-modulon can be assessed using the F1-score, which is the harmonic mean of precision and recall of the i-modulon compared to the regulon (Supplementary Figure   5f). We inspected all Regulatory i-modulons that were found in both an individual dataset and the combined compendium and found that the average F1-score increased from 0.54 to 0.59 295 (Wilcoxon Rank Sum p-value = 10 -5 ) (Supplementary Figure 5g). This result showed that on average, Regulatory i-modulons from the combined compendium were more similar to the previously known regulons than the Regulatory i-modulons from individual datasets.
Finally, we checked whether the i-modulon activities were affected by dataset integration. 300 When we apply ICA to an expression profiling dataset, we obtain the S matrix, encoding the imodulon structure, and an A matrix, which contains activity levels for each i-modulon across every expression profile (Figure 1a and Figure 4a). We have focused our analysis thus far on the invariant properties of the S matrix, and briefly discuss the effects of data integration on the A matrix. 305 I-modulon activities reflect the overall change in expression of the i-modulon genes and serve as a proxy for transcription factor activities 26 . Therefore, it is important to ensure that the relative i-modulon activities are unchanged upon dataset integration. The absolute Pearson R correlation coefficient between the i-modulon activities of linked components showed that most i-310 modulon activities are unaffected by dataset integration (Supplementary Figure 5h).
Previously, principal component analysis (PCA) of integrated expression datasets revealed that the source of each dataset was the dominant discriminator 48, 49 . This result is recapitulated in our combined compendium (Figure 4g). However, since the i-modulon structure 315 of each dataset (encoded in S) is consistent across datasets, we see that technical heterogeneity is stored in the activity matrix (A) (Figure 4h). Therefore i-modulon activities cannot be compared across datasets but can still reliably be compared within the same dataset.

ICA elucidates similar structures in the E. coli proteome
Since the i-modulon structure was highly consistent across transcriptomics datasets, we 320 asked whether shadows of the transcriptional regulatory program could be observed in the proteome. Thus, we applied the extended ICA algorithm on two high-quality proteomics datasets from two independent research groups 50,51 ( Supplementary Datasets 1-3). The Proteomics-1 dataset had higher gene coverage whereas the Proteomics-2 dataset included more conditions (Figure 5a,b). 325 We characterized the i-modulons from the proteomics datasets as described previously (Figure 5c, Supplementary Dataset 4), and used the RBH method to compare the proteomic imodulons to i-modulons from the RNAseq-1 dataset (Figure 5d, Supplementary Table 4). We note that most i-modulons from the proteomics datasets were Uncharacterized, likely due to the 330 intrinsically lower coverage and small dataset sizes.
One pair of Uncharacterized i-modulons were independently detected in the RNAseq-1 dataset and the Proteomics-1 dataset (Supplementary Figure 6). The RNAseq-1 i-modulon contained many membrane-associated genes ( Figure 5e) and was active under osmotic stress 335 and in laboratory evolved strains (Figure 5f,g). In addition, many genes in this i-modulon had been identified in two prior studies as genes upregulated by the envelope stress Rcs-phosphorelay 52 or upregulated in response to beta-lactam antibiotics that inhibit cell wall biosynthesis 53 .
Therefore, we propose that all the genes in this i-modulon are co-regulated by a transcription factor related to the Rcs system that may currently be uncharacterized. This represents a clear 340 potential for discovery.
It is important to note that the RNAseq-1 i-modulons were more similar to the other transcriptomics i-modulons than proteomics i-modulons (Mann-Whitney U Test, p-value = 0.005).
Although some of this deviation may be due to differences between relative abundances between 345 transcripts and proteins, the small sample sizes of the proteomics datasets resulted in i-modulon coalescence as previously described with the MA-3 CysB i-modulon. In particular, one Uncharacterized i-modulon from the Proteomics-1 dataset was only active under two media conditions: LB rich media, and Glycerol M9 media supplemented with amino acids (Figure 5h).
Due to the lower coverage of genes and low number of conditions in the proteomics 355 datasets, it is not currently possible to conclusively state that the i-modulon structure is conserved between proteomics and transcriptomics datasets, but this section provides evidence that such structure exists.

Discussion:
Matrix decomposition is a powerful approach to extract knowledge from large transcriptomics datasets. In particular, we have shown that ICA identifies highly similar structures 375 between dissimilar datasets for the model bacteria E. coli. In addition, a combined compendium produced many identical i-modulons as the individual datasets and could distinguish further signals that could not be identified in the separate datasets. The i-modulons derived from the compendium can be applied to interpret new datasets, accelerating discovery and providing a standard framework that could be used to investigate any transcriptional regulator. 380 Throughout this study, we observed various properties of the i-modulon decompositions: (1) i-modulons co-occurring in multiple datasets tend to represent the effects of transcriptional regulators, (2) i-modulon detection from a data set is dependent on the experimental conditions Another important note is that the i-modulons extracted from each dataset are sensitive to the experimental conditions that are represented in the dataset; ICA cannot extract an i-modulon for a transcription factor whose activity never changes. Additionally, the CysB i-modulons demonstrated that i-modulons may represent the effects of multiple regulators, when the activities 400 of the regulators are highly correlated across the measured conditions. However, adding new data or increasing the dimensionality of the decomposition can decouple the regulators, splitting the i-modulon into its biological parts 30,44 .
In contrast to the condition-invariant i-modulon structure, i-modulon activities represent 405 the condition-dependent dynamics of expression profiles. In this study, we do not apply any normalization techniques to the data and clearly observe batch effects in the activity matrix of the combined compendium. Further work comparing identical experimental conditions from separate protocols is required to enable i-modulon activity comparisons across datasets. 410 We have shown that the E. coli transcriptome contains a conserved, underlying structure that is found across multiple independent datasets and extends into the proteome. Previously, we also showed that this structure also exists across strains within a species 30 . This powerful observation enables unprecedented re-analyses of thousands of previously published datasets and demonstrates how data science can unlock hidden potential in complex biological datasets. 415 Methods:

RNA-seq and Microarray Data Processing
The full, log-transformed transcripts-per-million (log-TPM) for the RNAseq-1 dataset 30 was obtained from https://github.com/SBRG/precise-db. Raw data comprising the RNAseq-2 420 compendium were obtained from NCBI SRA under the bioproject accession number PRJNA379428 31 . Raw data was processed using a similar process as described in Sastry et al. 30 . Raw sequencing reads were mapped to the reference genome (NC_000913.

Proteomics data processing 455
Processed data for the Proteomics-1 dataset was acquired from Schmidt et al. 50 , using the values reported for protein copies per cell. Only genes with measurements across all samples were retained, resulting in 2058 genes. Processed data for the Proteomics-2 dataset was acquired from Heckmann et al. 51 and augmented with additional samples for growth on different carbon sources that were acquired as described in Heckmann et al. 51 (See Supplementary Dataset 1 for details). 460 Protein abundance estimation is described in Heckmann et al. 51 ; in short, the top3 metric 63,64 was calibrated using the UPS2 standard to obtain protein amount loaded based on the average of three technical replicates per sample. For each strain, two biological replicates were profiled, and abundances were averaged before applying ICA.

Independent Component Analysis
ICA decomposes a matrix (X) into two matrices: S contains the independent signals, or structure, of the dataset, and A contains the condition-dependent activities of the signals: ICA was applied to each individual dataset and the combined compendium, as described in Sastry et al. 30 . Briefly, we executed FastICA 100 times with random seeds and a convergence tolerance of 10 -6 for RNA-seq data, and a convergence tolerance of 10 -7 for proteomics data. We constrained the number of independent components (ICs) in each iteration to the number of 475 components that reconstruct 99% of the variance as calculated by principal component analysis.
The resulting ICs were clustered using DBSCAN to identify robust ICs, with parameters with epsilon of 0.1, and minimum cluster seed size of 50. This process was repeated 10 times, and only ICs that consistently occurred in all runs were kept.

480
As described in Sastry et al. 30 , i-modulons were extracted from ICs by iteratively removing genes with the largest absolute value and computing the D'Agostino K 2 test statistic of the resulting distribution. Once the test statistic fell below a cutoff, we designated the removed genes as the "i-modulon".

485
To identify this cutoff for each individual dataset, we performed a sensitivity analysis on the concordance between significant genes in each IC and all known regulons. First, we isolated the 20 genes from each IC (10 genes for proteomics i-modulons) with the highest absolute gene coefficients. We then compared each gene set against all known regulons using the one-sided Fisher's Exact Test (FDR < 10 -5 ). For each component with at least one significant enrichment, 490 we selected the regulator with the lowest p-value.
Next, we varied the D'Agostino K 2 test statistic from 200 through 1000 in increments of 50. Using the protocol defined above, i-modulons were extracted from ICs at each test statistic value, and the F1-score (harmonic average between precision and recall) was computed between the 495 significant genes and its linked regulator. The test statistic with the maximum F1-score was used as the test statistic cutoff for the respective dataset.

Characterizing I-modulons
We compared the set of significant genes in each i-modulon to each regulon (defined as the set 500 of genes regulated by any given regulator) using the one-sided Fisher's Exact Test (FDR < 10 -5 ).
We then compared the significant genes in each i-modulon to the genes in each Gene Ontology (GO) term using the one-sided Fisher's Exact Test (FDR < .01). We added prophage information from Ecocyc to our GO database to capture i-modulons representing prophages. In general, the final annotation was selected by the enrichment with the lowest q-value. Some i-modulon annotations were manually curated, as denoted in Supplementary Dataset 5. Genomic imodulons were also manually curated by (1) comparing i-modulon genes to known genetic alterations (e.g. knock-outs or overexpression), and (2) validating that the i-modulon activities were affected in the appropriate direction for the corresponding strain.

510
To account for the lower coverage of the proteomics datasets, we trimmed the TRN and GO databases to only include relevant genes. We compared proteomics i-modulons to regulons with FDR < 10 -3 and compared proteomics i-modulons to GO terms with FDR < .01.

Comparing I-modulon Structures: 515
To compare the complete structure of the transcriptomic datasets, we constructed the reciprocal best hit (RBH) graphs using the full IC gene coefficients, rather than just the i-modulon genes.
We generated the RBH graph as described in Cantini et al. 2019 29 , using the following distance metric to compare ICs: 520 dx,y = 1 -||ρx,y|| where ρx,y is the Pearson correlation between components x and y.
We pruned all RBHs to remove links between ICs with similarities below 0.3. The full graph is 525 shown in Supplementary Figure 1. The RBH graph was plotted using GraphViz 65 .

Linear Regression of I-modulons:
Regression of the MA-3 CysB i-modulon and the Proteomics-1 i-modulon were performed using the LinearRegression function from Scikit-learn 66 . The ten ICs from the RNAseq-1 dataset with 530 the highest absolute IC gene correlations with each of the CysB i-modulon and Proteomics-1 imodulon were selected as regression variables for their respective regressions (See Supplementary Tables 1 and 5).

Inference of I-modulon activities: 535
Raw reads for the ppGpp-RNAP dataset were downloaded from NCBI SRA (PRJNA504613) and processed as described above into log-TPM expression values. Two experimental conditions were selected for comparison, both using the wild-type strain with active relA, at 0 and 5 minutes after IPTG induction. Log-TPM expression values were averaged across triplicates.

540
To infer i-modulon activities and calculate the amount of variance that i-modulons explain between the two datasets, we first centered the two averaged expression profiles, then computed the genewise difference. The change in i-modulon activity was calculated by multiplying the expression difference (ΔX) by the pseudo-inverse of the S matrix from the full compendium:

Explained Variance: 550
Explained variance between two conditions was calculated as follows: Where k is the i-modulon of interest.

555
Code availability: Code central to the conclusions is described in the methods and available at https://github.com/SBRG/precise-db. Additional code is available upon request.