A resource for analyzing C. elegans’ gene expression data using transcriptional gene modules and module-weighted annotations

Identification of gene co-expression patterns (gene modules) is widely used for grouping functionally-related genes during transcriptomic data analysis. An organism-wide atlas of high quality fundamental gene modules would provide a powerful tool for unbiased detection of biological signals from gene expression data. Here, using a method of independent component analysis we call DEXICA, we have defined and optimized 209 modules that broadly represent transcriptional wiring of the key experimental organism C. elegans. Interrogation of these modules reveals processes that are activated in long-lived mutants in cases where traditional analyses of differentially-expressed genes fail to do so. Using this resource, users can easily identify active modules in their gene expression data and access detailed descriptions of each module. Additionally, we show that modules can inform the strength of the association between a gene and an annotation (e.g. GO term). Analysis of “module-weighted annotations” improves on several aspects of traditional annotation-enrichment tests and can aid in functional interpretation of poorly annotated genes. Interactive access to the resource is provided at http://genemodules.org/.


Introduction
Nearly half of the predicted protein-coding genes in Caenorhabditis elegans lack a functional annotation based on direct experimental evidence 1 . As a result, querying gene annotation databases, such as the Gene Ontology (GO) or curated-pathways can fail to detect biologically meaningful signals in gene expression data 2-4 . An alternative approach to understanding gene function is to use information about genes' transcriptional activity. Gene-expression data can be used to define groups of genes that show similar patterns of expression, or co-variation, across multiple conditions. These groups are called transcriptional gene modules, with each module potentially representing a discrete biological phenomenon. Gene modules are routinely constructed when clustering algorithms are applied to gene expression data and have been used successfully to identify gene regulatory mechanisms in a variety of contexts, from the yeast cell cycle 5 and sporulation 6 to human cancer cells 7 to cognitive decline in patients with Alzheimer disease 8 . Furthermore, large compendia of data sampling diverse perturbations have been used to define fundamental gene-expression programs of entire organisms [9][10][11] .
In C. elegans, the most recent effort to generate high-quality fundamental transcriptional modules is now almost two decades old 12 . Co-expressed genes were grouped together into 43 groups (or "mountains") based on their correlation across 553 microarrays. However, the compendium did not contain all genes (e.g. over 30% of the microarrays only contained ~11,000 of C. elegans ' 20,470 protein-coding genes) and each gene was assigned exclusively to one group, although it is well-established that genes can participate in multiple processes 13,14 . The number of perturbations for which gene expression data are available has also increased substantially since 2001.
Here, we define 209 transcriptional gene modules in C. elegans using a heterogeneous compendium of 1386 microarrays and a method we call DEXICA, for Deep EXtraction Independent Component Analysis. DEXICA builds on prior implementations 10,15,16 of independent component analysis (ICA) for gene module extraction by maximizing the biological information content of the modules. It does so by varying data pre-processing methods and the number of extracted ICA components (modules) until the number of biological annotations is maximized. DEXICA also uses an artificial neural network to partition each independent component, i.e. gene module, to determine which genes should be included and which should be excluded.
We show that the 209 DEXICA C. elegans modules capture gene expression patterns that correspond to biological processes; for example, responses to osmotic stress, xenobiotics and pathogenic bacteria, and to several individual tissues.
Furthermore, data analysis in the module space correctly reveals biological processes that are missed by analyses of differentially-expressed genes. We provide a userfriendly web interface in which users can test which of the 209 gene modules are active in their datasets and find detailed information about each module that helps determine which biological process(es) they represent.
Finally, we explore whether gene modules can be used to improve existing gene annotations. We reason that an annotation shared among co-expressed genes is more likely to be relevant to their function than one that is not, and that annotations of coexpressed genes can provisionally be "transferred" onto their poorly-annotated companions. We calculate what we call "module-weighted gene annotations" by weighting the association between a gene and an annotation by the degree to which the annotation appears predictive of the gene's module membership. We show that matrixbased analysis of module-weighted annotations is more sensitive and specific than common annotation enrichment tests. We provide a framework for using moduleweighted annotations to detect significant GO terms and promoter oligonucleotides directly from expression data, and to identify novel GO terms conferred onto genes based on their module membership.

Development of DEXICA and extraction of gene modules
A large body of gene expression data is publicly available 17,18 and has enabled computational prediction of gene modules 10,12,[19][20][21] . We refer to our method for going from a raw compendium of gene expression data to an optimized set of gene modules and a list of genes that belong to each module as DEXICA, for Deep EXtraction Independent Component Analysis (described below).
While several methods exist for defining gene modules, independent component analysis (ICA) generally outperforms clustering-based approaches and principle component analysis in extracting biologically relevant signals from large datasets 15,[22][23][24] .
Advantages of ICA include its ability to deal well with high dimensional data and to generate modules that can share genes. Furthermore, ICA does not assume that latent signals in the data follow a Gaussian distribution, an important property for gene module prediction, as gene regulation signals appear to be primarily super-Gaussian 25 . For these reasons, we chose to use ICA for module construction.
Briefly, ICA is a blind source separation method that attempts to "unmix" a signal comprising additive subcomponents by separating it into statistically-independent source signals 26 . In the common notation, an n x m data matrix, X, is decomposed into two new matrices -a n x k source matrix, S, and an k x m mixing matrix, A, where k is the number of independent components: In the context of gene-expression analysis, X is a matrix of m measurements (e.g. microarrays) of n genes, and k independent components are interpreted as gene modules. A indicates the weight of each module in each microarray and S indicates the relative level of inclusion of each gene in each module 27 (Figure 1a).
Using simulated data, we found that module-prediction accuracy was highest when the number of extracted components matched the true number of modules (Supplementary Figure 1). Therefore, we sought to optimize module prediction by evaluating results based upon their biological information content, such as enrichment of Gene Ontology (GO) terms 2 , REACTOME pathways 4 , and tissue-specific expression 28 . We applied ICA to a diverse compendium of 1386 C. elegans Affymetrix arrays obtained from the Gene Expression Omnibus (GEO) database 18 . We then compared results obtained from a wide variety of preprocessing methodologies and total number of components extracted. Omitting between-experiment quantile normalization from the preprocessing procedure produced modules that were more annotation rich than those produced by a published implementation of ICA-based module extraction 10 (Figure 2a-c). Furthermore, annotation content of the modules was maximized when the number of gene modules (i.e. independent components) ranged from 191 to 226.
To enable functional evaluation of the modules, it is useful to summarize them as discrete gene sets. To this end, we partitioned each column of the S matrix into three sets of genes: one set consisting of genes excluded from the module, and two other sets consisting of genes regulated in opposite directions. We refer to these latter two sets as "hemi-modules", one set consisting of genes with highly positive weights and another consisting of genes with highly negative weights (signs assigned arbitrarily) in the independent component. While others have used a fixed-threshold approach to component partitioning 10,29,30 ; for example, defining genes with weights exceeding +/-3 standard deviations from the component mean to be "in" each hemi-module, we found that individual modules showed maximum annotation enrichment at different thresholds, suggesting that a 'one-size-fits-all' approach to partitioning is sub-optimal. An alternative approach to partitioning that we attempted (described in Frigyesi et al. 31 ) failed to converge in many cases (data not shown). Therefore, to increase partitioning accuracy, we trained a function to predict partitioning thresholds from the shape of  Figure 2d. While others have reconciled such differences through a clustering approach applied to the output of numerous runs of the algorithm (so called "iterated ICA") 10,31 , when applied to the C. elegans Affymetrix compendium, many of the final components generated by this method were highly correlated to one another, indicating non-independence and potential redundancy among the components (data not shown). We therefore sought to choose a single, high quality, fastICA run output to use as predicted gene modules. Because we considered word enrichment the most unbiased measure of module quality (as it relies only on DNA sequence data), we chose as our final module set (Supplementary Table 1 Co-expressed genes often share common structural elements. For example, genes within operons are switched on together during recovery from growth-arrested states in C. elegans 33 and 3'-UTR length is associated with proliferation in cancer cells 34 . To further explore the information content of DEXICA-extracted modules, we tested each hemi-module for over-and under-enrichment of genes with long 3'UTRs, for genes appearing in operons and for genes with multiple splice forms. Of the 418 hemimodules, 65 contained a significant bias toward long 3'-UTR genes and 58 contained a bias toward short 3'-UTR genes (q < 0.1, threshold chosen based on randomized control trials, see below; Figure 2e). Twenty-one hemi-modules were significantly enriched and 205 hemi-modules were significantly depleted for operon genes, and 81 hemi-modules were enriched and 80 hemi-modules were depleted for genes with multiple splice variants ( Figure 2e). Control tests performed on the same module set but with randomly scrambled gene IDs produced no significant modules below q = 0.1 for any of the gene properties we tested (data not shown). Therefore, DEXICA successfully groups genes with common structural features into modules, consistent with the known relationship between gene structure and expression.
Together, these results show that the 209 C. elegans gene modules extracted from a large microarray compendium are enriched for GO terms, tissue and pathway annotations as well as for potential regulatory DNA sequences and gene structural properties. DEXICA is available as an R package (https://github.com/MPCary/DEXICA). It provides tools to optimize ICA module extraction and partitioning based on annotation enrichment and can be applied to any gene expression compendium.

C. elegans DEXICA modules are biologically informative
We observed enrichment of functional annotations and oligonucleotide sequences in DEXICA-extracted modules with even the least optimal parameter settings (see Figure   2). To further test the biological significance of the final 209 modules and to begin annotating them, we constructed an alternate microarray compendium comprising not 1386 individual arrays, as in the original compendium, but rather the gene fold changes arising from contrasting experimental and control samples in the same experiment.
Projecting the resulting 716-column matrix (one for each contrast) into the space defined by our gene modules allowed us to see which experimental perturbations activate or inhibit which modules.
We compared the most enriched GO terms in each module to the most strongly activating or inhibiting experimental perturbations and observed that in many cases these were in obvious agreement (Supplementary Table 2). For example, the strongest activator of m10 (module 10) in the compendium is a mutation in osm-7, a gene needed to respond to hypertonic stress through accumulation of glycerol. Consistent with this, the top GO terms enriched within genes that comprise m10 describe glycerol metabolism and accumulation. Module 153 is strongly activated by mutations in lin-35, a gene that is part of the DRM/DREAM complex, and accordingly, this module is also enriched for "DRM complex" GO terms. Similarly, m200 is activated in response to the pathogen P. aeruginosa and the top GO terms enriched within m200 genes include "defense response to Gram-negative bacteria" and "innate immune response".
DEXICA modules can also capture gene expression patterns that distinguish different tissues. For example, m23 is strongly activated by neuronal RNA and the top GO term enriched in this module is "neuropeptide signaling". Similarly, m144 distinguishes muscle cells from other cell types in the worm and is enriched for the "sarcomere organization" and "striated muscle dense body" GO terms. Together, these results indicate that DEXICA modules represent biologically meaningful sets of genes.
It is important to note that a lack of an obvious agreement between the nature of a perturbation that strongly activates a module and the top GO terms enriched in that module does not necessarily indicate that a module is biologically meaningless. On the contrary, this apparent disagreement can be a powerful tool for understanding complex transcriptional responses to perturbations. For example, starvation of L1 larvae induces a developmental arrest and has widespread effects on metabolism 35 . Module 51 is strongly activated by starvation and is enriched for "snoRNA/nucleolus" GO terms.
Thus, this module represents genes that play a role in ribosome biogenesis and protein synthesis, reduction of which is an important response to starvation. Obvious agreement between a perturbation and GO terms would also be absent if a process had not previously been associated with a particular condition, making modules useful for hypothesis generation. For example, a wild C. elegans isolate, JU1580, shows strong activity of m55 relative to the laboratory C. elegans strain, N2, (Supplementary Table 2) and genes within m55 are enriched for the "innate immune response" GO term. This suggests that the immune systems of wild and lab C. elegans strains have different levels of basal activity, a connection that, to our knowledge, has not been proposed previously.

Annotation of DEXICA modules
Detailed information about modules that are active in a gene-expression experiment would be of primary interest to an investigator. As a comprehensive resource, we A limitation of using GO terms and other pre-existing annotations to interpret gene modules is that some cellular activities are poorly annotated (see UPR mt example below). For this reason, an additional strategy to understanding what a given module represents is to examine its activity under a variety of conditions. To facilitate this, in the Module Annotation Pages we have also provided a ranked list of perturbations (from the 716 we generated) that activate each module significantly (also see Supplementary Table 2). While it is unlikely that a typical gene module can be described completely using a single semantic annotation due to the complexity of gene regulatory networks, we suggest processes that modules likely represent based on our comparison of module activity under several conditions and our examination of enriched annotations (Supplementary Table 2). In a typical workflow, once the active modules in a gene expression experiment are identified by a researcher, we recommend that the researcher consult Module Annotation Pages and Supplementary Table 2 to infer the biological process(es) described by the modules of interest.

Mitochondrial unfolded protein response
To test the utility of DEXICA modules for interpreting C. elegans gene expression data, we asked whether we could identify known biological processes activated in a mutant with a complex phenotype. Mutations in components of the electron transport chain reduce respiration, slow development and reproduction and increase lifespan in C. elegans 36 . The mitochondrial unfolded protein response (UPR mt ) is induced in response to a stoichiometric imbalance between nuclear and mitochondrial proteins within mitochondria, and this process is known to be active in long-lived respiration mutants 37,38 . However, analysis of significantly differentially expressed genes in the isp-1 respiration mutant, either using GO terms or KEGG pathways, failed to identify UPR mt or, for that matter, any process related to mitochondria or a response to stress ( Figure   3a).
Calculation of module activity (Tool 1) revealed that m47, m66 and m169 display the highest activity in the isp-1 mutant (Figure 3b). These modules are strongly active in other respiratory chain mutants as well (see Module Annotation Pages), but the most strongly activating non-respiration perturbation of m47 is a mutation in spg-7, a mitochondrial protein quality-control protease, disruption of which is known to induce UPR mt 39 ( Figure 3b). This result suggested that m47 may encompass UPR mt genes. If so, then m47 should be active in animals with constitutive activity of the transcriptional UPR mt regulator, ATFS-1. To test this prediction, we calculated module activity in atfs-1 gain-of-function mutants (this perturbation was not part of the original compendium, GEO Accession number GSE73669). Indeed, induction of the UPR mt transcriptional response in otherwise healthy animals using this mutation strongly and specifically induced activity of m47 ( Figure 3c). Furthermore, genes that belong to m47 showed concordant expression in the isp-1 respiration mutant, in response to UPR mt induction by disruption of spg-7, and in response to constitutive activity of ATFS-1 in normal animals ( Figure 3c, d). As expected, mutation of atfs-1 prevented these changes in animals with an induced UPR mt (spg-7 RNAi) (column 4 Figure 3d). Moreover, two chaperone genes known to be induced during UPR mt (hsp-6 and dnj-10) are part of m47. Taken together, these data strongly suggest that m47 represents genes induced by ATFS-1 during UPR mt . Therefore, analysis of module activity, but not GO term or KEGG pathway enrichment analyses, was able to correctly identify a key biological process using gene expression data from isp-1 mutants.
Why did GO term enrichment analysis fail to recover the "mitochondrial unfolded protein response" term (GO:0034514)? To our surprise, there are only eight C. elegans genes annotated with this GO term (haf-1, ubl-5, gcn-2, atfs-1, clpp-1, dve-1, hsp-6 and hsp-60). Among these, only hsp-6 and hsp-60 are induced during UPR mt , whereas the others are genes needed for activation of UPR mt . In the KEGG pathway database, UPR mt is not included as a discrete entry, but is part of the "Longevity regulating  Figure 4). In contrast, when we tested enrichment of the ATFS-1 gene set within modules, we found a highly significant enrichment within m47 (Figure 3e).
These results show that given a set of functionally related genes, testing enrichment of that set within modules can be more informative than testing enrichment within a ranked list of gene changes, likely because modules comprise groups of genes that are functionally related.

Hypoxia inducible factor 1
Reduction-of-function isp-1 mutations extend lifespan in a manner dependent on the hypoxia inducible factor 1 (HIF-1) 42 , but, unexpectedly, we found that the overlap between the significant genes obtained by comparing isp-1 or hif-1 mutants to wild type was not statistically significant (Χ 2 test p-value = 0.17; Supplementary Figure 5a). The other highly-active modules in isp-1 mutants besides m47 are m66 and m169. We   we observed the similarity between the hif-1 and isp-1 projections, we might have hypothesized a role for hif-1 in isp-1 mutants before such a role was discovered from genetic screening 42 .

Using modules to improve gene annotations
Because gene annotations can be incomplete and biased, knowledge about which genes tend to be co-expressed in an organism can help inform which annotations are more likely to be biologically relevant and help infer annotations of orphan genes based on annotations of other module members (see Supplementary information for additional discussion). To this end, we devised a calculation of scores, which we call "moduleweighted annotations", based both on how much a given annotation is enriched in each module (matrix E, Figure 1b-c) and the degree to which a given gene belongs to that same module (matrix H, Figure 1c).

Module-weighted GO terms
To test the utility of module-weighted annotations, we first examined whether they We ranked each GO term by the weight of its most strongly associated gene (Table 1). GO terms with the most highly weighted genes were "ribosome" (CC) and "structural constituent of ribosome" (MF). Consistent with this finding, genes involved in ribosome biogenesis are known to be transcriptionally co-regulated 43 . On the other hand, GO terms associated with signal transduction and kinase activity showed the lowest gene weights. These results are statistically significant (see Online methods) and show that some types of gene groupings (e.g., genes encoding kinases) often used in the analysis of gene expression data may not actually represent functions that are coordinately regulated at the gene transcript level. Therefore, we suggest that researchers use caution when interpreting a result that certain annotations, like "kinases", appear to be enriched in a transcriptomics experiment.
A sensitive test will tend to give similar results despite small amounts of random noise being added to the input data. To test the sensitivity of a module-weighted annotation-based enrichment test, we obtained gene fold changes for the 5 most recent C. elegans Affymetrix experiments deposited to the GEO database. These experiments were not included in the data used to construct the modules. We then added varying amounts of Gaussian noise to the fold changes and for each level of noise, we calculated enrichment z-scores for each GO term using three different methods: the Kolmogorov-Smirnov (KS) test, the t-test (n.b., one-sample and two-sample t-tests produced nearly identical results, data not shown), and scalar projection of the gene fold changes onto a module-weighted GO term matrix. We then compared these results to those obtained by each method when no noise was added (Figure 4a.) Dissimilarity to the initial results (zero added noise) increased rapidly with added noise for both the KS test and t-test, while the projection-based results were similar (ρ > 0.75) at even the highest noise levels tested (5 standard deviations), suggesting that the projection-based test is more sensitive than both the KS and t-tests.
A specific test will tend to show dissimilar results when given dissimilar inputs.
To test the specificity of module-weighted annotation analysis, we selected the 100 Module-weighted GO terms for each gene are provided in Supplementary Table 4 and their significance in a query gene expression dataset can be tested using our C.
elegans gene-modules analysis tools (see Data availability Tool 6).

Module-weighted promoter words
While any Boolean gene annotation may be converted into a module-weighted annotation, module-based weights seem particularly well suited for describing regulatory sequences, such as putative transcription factor and microRNA binding sites.
To this end, we generated weighted annotations for each of the 5230 words in the promoter word dictionary we constructed using the Mobydick algorithm (described above and in Methods).
To validate predicted regulatory word weights, we searched the literature, the JASPAR database of transcription factor binding profiles 46 , and the GEO database for For hif-1 and nhr-23, the most significantly enriched words in the positively and negatively changing genes, respectively, matched the canonical binding sites ( Table 2).
A word matching the hlh-30 binding site scored 6 th overall among the up-regulated genes, and for daf-12, four of the top 20 words for the up-regulated genes contained GAACT or AACTT, which partially matched the reverse compliment of a reported daf-12 binding half-site, AGTTCA 47 . In the daf-16 data set, several words matching the socalled "daf-16 associated element" (DAE) 48  Finally, we tested whether analysis of module-weighted promoter words could reveal the activity of the HIF-1 transcription factor in isp-1 respiration mutants, missed by analysis of significant genes but implied by analysis of gene module activity (described above). We calculated promoter word z-scores for the isp-1 microarray data set and compared the results to those for hif-1. As predicted, we observed a very strong anti-correlation between the promoter word z-scores for isp-1 mutants and those These results further support the utility of module-weighted promoter oligonucleotides for identifying biologically relevant regulatory sequences directly from gene expression data.

Module-weighted annotations as a hypothesis generation tool
Our process of weighting gene annotations using modules provisionally transfers annotations that are enriched in a module to all gene members of that module (proportional to each gene's inclusion). This can be useful for studying individual genes (see Data availability Tool 5 and Supplementary Table 4) -if a query gene has a strongly-weighted association to a GO term with which it is not traditionally associated, it means that this gene is co-expressed with other genes that are traditionally associated with the GO term. For example, most of the module-weighed GO terms associated with the small ribosomal subunit S16 (rps-16) have something to do with the ribosome or translation (see Data availability Tool 5), implying that rps-16 is tightly co-expressed with other genes involved in ribosome biogenesis and not much else. On the other hand, while sod-3, a superoxide dismutase, does have significant weights for GO terms with which it is traditionally associated (e.g. oxidoreductase activity, superoxide metabolic process and response to superoxide), catalase activity ranks more highly for sod-3. This means that sod-3 is co-expressed with genes annotated as catalases.
While it is unlikely that sod-3 has a novel catalase activity, it is more likely that expression of sod-3, a superoxide dismutase, is coordinated with expression of catalases, since hydrogen peroxide produced by sod-3 is further degraded by catalases to avoid damage to the cell. Some of the other top-ranking module-weighted GO terms for sod-3 describe dioxygenase activity. Similarly, this may indicate that dioxygenases generate superoxide radicals, and therefore increased expression of these enzymes is typically correlated with increased expression of a superoxide dismutase. Therefore, analysis of module-weighted annotations of individual genes can help form hypotheses about novel gene functions and transcriptional co-regulation of distinct biological processes.

Discussion
We have captured much of the transcriptional wiring of C. elegans by extracting gene co-expression modules from a large compendium of data. These 209 modules represent transcriptional signatures of diverse biological processes that can occur in C.
elegans and, along with the tools we provide, can be used as a resource for analyzing gene expression data. Because genes are grouped into modules based purely on how they behave in experimental assays, and because modules encompass both previouslyannotated and unannotated genes alike, modules can reveal signals within transcriptomic data that otherwise would be missed due to incomplete knowledge of gene function, subtle gene expression changes or noisy data.
Experimentally, gene-expression modules can be used to deconvolve complex phenomena into subsets of co-regulated genes, genes that likely act together to mediate a specific process. The modules can be annotated extensively, as we have done, and these annotations can be applied provisionally to all genes in the module.
Regulatory factors associated with specific annotations (like 5' or 3' oligonucleotide words) can be implicated as well, and functional links can be revealed between dissimilar conditions that activate the same modules.
ICA has been applied to the prediction of gene modules before, but we could find no examples in the literature of optimizing the process using biological metrics in the manner that we describe. Combined with the improved ability to partition independent components provided by our artificial neural network approach, we expect that DEXICA will be useful for constructing gene modules for other organisms. The DEXICA software and our C. elegans modules and data are freely available as R packages online (see Data availability).
While we have taken steps to maximize module prediction accuracy for the microarray compendium we assembled, many additional gene modules may exist in C.
elegans that were not perturbed sufficiently in the samples comprising the compendium to be detected. These gene modules would remain hidden. As new areas of research are explored and new experiments are published, however, new fundamental gene modules may be discovered. For example, most of the gene expression data in our compendium was collected using whole animals. Data collected from isolated cells or tissues could help produce modules that are active in a relatively small number of cells.
The DEXICA package can be used to create new and improved gene modules using compendia with expanded information content.
The impetus for constructing gene modules in C. elegans was to create gene groupings that do not rely on the existing annotations, because annotations of many genes are missing or incomplete. We then wondered whether numeric scores based on gene co-expression could actually improve the existing gene annotations, leading us to develop the concept of module-weighted annotations. By weighting an association between each gene and each annotation by the degree to which that annotation appears to predict gene modularity, annotations that are shared by co-expressed genes are "boosted" and those that are not are diminished. It is important to note, however, that relevance scores between genes and annotations could be calculated using other metrics of gene behavior as well (e.g. a gene might be weakly associated with a term in the context of gene expression but strongly associated with it in the context of physical protein interactions).
We have found that analysis of module-weighted GO terms is less sensitive to noise than are typical statistical over-representation tests. Furthermore, because module-weighted annotations incorporate information about gene expression, they effectively model the gene-gene correlation structure of the system. This is useful because typical over-representation tests do not perform well with gene sets that have a high level of gene-gene correlation (i.e. annotations assigned to genes that are strongly co-expressed are more likely to be significant [49][50][51] . GSEA, for example, deals with gene-gene correlation issue using permutation. We show that module-weighted GO terms produce significantly fewer false positives than do over-representation tests (see Figure 4b). Therefore, we think that module-weighted Finally, because annotations shared among a significant fraction of the genes in a module are "transferred" to all genes in the module, module-weighted annotations can be used provisionally to infer novel functions for genes, which is especially useful for studying poorly annotated genes.

Compendium construction
To build the compendium of 1386 C. elegans Affymetrix arrays, we first downloaded all CEL files with the appropriate platform ID (GPL200) from the GEO database available on March 1, 2014, excluding those for which the organism was not C. elegans and the sample type was not RNA. We excluded arrays from experiments for which fewer than 8 hybridizations were performed in order to mitigate the effect that under-sampled conditions might have on predicted modules. We then performed a quality control step using the quality assessment functions provided in the simpleAffy (v2.40.0) R package (http://bioinformatics.picr.man.ac.uk/simpleaffy/), discarding arrays that did not meet the quality thresholds recommended in the simpleAffy documentation.
We generated expression values for probesets separately for each experiment (determined by GEO series IDs) using the RMA preprocessing procedure provided in the affy (v1.40.0) R package 53 , then used the bias (v0.0.5) R package 54 to remove intensity-dependent biases in expression levels. We then concatenated the expression matrices for each experiment into a single matrix. Next, we either performed between-experiment quantile normalization on the entire matrix using the limma (v3.18.13) R package 55 , or omitted this step, depending on preprocessing method to be tested.
Finally, we scaled and centered the arrays and centered the genes such that the mean of each row and column were zero and the standard deviation of each array was 1.

Conducting ICA
To conduct ICA of the gene expression matrix, we used the fastICA (v1.2-0) R package (http://CRAN.R-project.org/package=fastICA) with default parameters except for the "method" parameter, which we set to "C" to increase computational speed, and the "row.norm" parameter, which we set to "TRUE" in order to balance the total compendium variance between genes with subtle changes in expression values and those with large changes in expression values.

Partitioning of independent components
To convert independent components to discrete sets of genes, we employed two methods. In the first, for each component, we assigned all genes with a weight <= -3 to the negative hemi-module, and all genes with a weight >= 3 to the positive hemimodule. In the second, we created an artificial neural network using the neuralnet genes whose weights exceeded these thresholds to the corresponding hemi-modules.

Obtaining gene annotations and microarray data
To obtain GO term and REACTOME pathway annotations for genes we used the biomaRt (v2.18.0) R package 56 , using the ensembl mart for data retrieval. To obtain tissue annotations for C. elegans genes, we downloaded all available data from the GFP Worm database (http://gfpweb.aecom.yu.edu/) 28

Optimizing gene module prediction
To optimize gene module prediction, we performed ICA with a variety of different data preprocessing options (e.g., the choice of preprocessing algorithm (RMA, GCRMA, PLIER, MAS 5.0), background, perfect match, bias correction, and normalization methods), and with a varied number of extracted components from 5 to 500 by increments of 5. For each parameter combination, we repeated ICA 5 times.
We tested the biological validity of the independent components generated by each ICA run by determining the number of annotations that were enriched in at least one hemimodule. We chose not to optimize based on the number of modules with at least one significant annotation, as tests using simulated data showed that this approach could lead to signals being split into multiple, less accurate representations (data not shown).
We first calculated a p-value for the enrichment of genes associated with each annotation term in each hemi-module using the hypergeometric test. We then applied

Quantification of module activity in gene expression data
To project a data vector, x, such as a set of gene expression fold changes, onto a set of gene modules, we used the scalar projection method, in which a mixing vector, a, is calculated from the dot product of the data vector and the unit vectors comprising the module definitions, ܵ መ , as shown in equation 2: The resulting mixing vector, a, provides an indication of the weight of each module definition vector in the projected data, x. Projection of a data matrix, X, which generates a mixing matrix, A, was carried out using the same procedure.
To calculate signed variance explained (SVE), we calculated the relative variance explained (VE) for each module from a as follows, where n is the total number of modules: We then multiplied these values, which are strictly positive, by -1 in each case where a i < 0 to obtain SVE.

Generation of module-weighted annotations
Our calculation of module-weighted annotations takes advantage of the fact that, in modules generated by ICA, prior to partitioning, each gene has a weight in each module. Given a score or weight for each annotation in each module, this allows genes to be associated with annotations via a matrix product calculation. In the E matrix (see above) highly positive values correspond to strongly enriched annotations in a hemimodule, highly negative values correspond to strongly depleted annotations. We transform the gene module matrix, S, into an unpartitioned hemi-module matrix, H, by concatenating it with a negative copy of itself column-wise: The product of this matrix with the E matrix produces matrix R, which relates genes to annotations: As a final step, we normalize the values in this matrix row-wise, i.e, separately for each annotation, by subtracting the mean and dividing by the standard deviation.

Analysis of GO terms based on gene weights
To test whether the ranking of GO terms based on gene weights (Table 1) is statistically significant, we constructed a list comprising all words used in all GO terms, excluding words shorter than three letters and uninformative words (e.g. "the" and "for".) We then tested each word for bias toward appearing near the top or bottom of the ranked GO term list. In agreement with our initial observations, the most significantly top-biased words pertained to macromolecular complexes, such as "nucleosome", "cilium", and "ribosomal", and the most significant bottom-biased words pertained to cell signaling, such as "signal", "kinase", and "receptor". Many of the GO terms containing cell signaling words were generic in nature, e.g., protein kinase regulator activity, thus, our results may partially be explained by a lack of co-regulation among constituents of different signaling pathways. However, some specific cell signaling terms, e.g., Notch signaling pathway also appeared near the bottom of the ranked GO term list, suggesting that the genes annotated with such terms are either not strongly co-regulated at the gene expression level or that the biological conditions represented by the compendium did not perturb their expression enough to form modules with our method.

Analysis of expression data with module-weighted annotations
To test whether a set of gene fold-changes were significantly enriched for specific annotations, given a module-weighted annotation matrix, R, we calculated the dot product of the data vector, x, comprising the set of gene fold-changes, and the R matrix: The resulting vector, a, provides an indication of the degree to which genes with strong weights for each annotation also have strong fold-changes. To generate p-values from these, we permuted the fold-change vector, x, 1000 times to create a background distribution for each annotation, which we then used to determine z-scores.

Data availability
The authors declare that all data supporting the findings of this study are available within the paper and its supplementary tables. Dynamic access to the data is also provided through a graphical user interface at http://genemodules.org/. Specifically, 6 separate functionalities are provided: