Concise functional enrichment of ranked gene lists

Genome-wide expression data has become ubiquitous within the last two decades. Given such data, functional enrichment methods identify functional categories (e.g., biological processes) that preferentially annotate differentially expressed genes. However, many existing methods operate in a binary manner, disregarding valuable information contained in the gene ranking. The few methods that consider the ranking often return redundant or non-specific functional categories. To address these limitations, we developed a novel method called Concise Ranked Functional Enrichment (CRFE), which effectively leverages the ranking information in gene expression data to compute a non-redundant set of specific functional categories that are notably enriched for highly ranked genes. A particularly useful feature of CRFE is a tunable parameter that defines how much focus should be given to the most highly ranked genes. Using four treatment-control RNA-seq datasets, we compared the performance of CRFE with the two most widely used types of functional enrichment methods, Gene Set Enrichment Analysis and over-representation analysis. We evaluated the methods based on their ability to utilize ranking information, generate non-redundant results, and return functional categories with high information content. CRFE excelled in all evaluated criteria, outperforming the existing methods, each of which exhibits deficiencies in at least one aspect. Using lung adenocarcinoma data, we further showed that the functional categories identified by CRFE are biologically meaningful. In conclusion, CRFE computes an informative set of functional categories that summarizes genome-wide expression data. With its superior performance over existing methods, CRFE harbors great promise to become a widely used functional enrichment method. Author summary Given a list of differentially expressed genes as input, functional enrichment methods reveal which functional categories (e.g., biological processes) were likely activated by the cell and are responsible for the differential expression. We developed a new such method, called Concise Ranked Functional Enrichment (CRFE), which addresses the limitations of current approaches by incorporating gene ranking information to compute a concise and specific set of enriched functional categories. Using four treatment-control RNA-seq datasets, we evaluate how well CRFE and the two currently most widely used methods perform in three criteria. We find that CRFE outperforms each of the alternative methods in at least one of the evaluated criteria, demonstrating its superiority. A high-level interpretation of the functional categories identified by CRFE for lung adenocarcinoma datasets highlights its usefulness for experimentalists. Overall, CRFE harnesses the power of ranked gene lists to generate a focused and non-redundant set of enriched functional categories. Our study positions CRFE as a promising method for functional enrichment analysis, with the potential to advance research in this field.

The advent of next-generation sequencing technology has ushered in a deluge of 2 high-throughput data in biological studies [1,2]. Over the past two decades, scientists 3 have become able to conduct comprehensive profiling studies on the genome, epigenome, 4 transcriptome, metabolome, proteome, etc, at increasingly affordable costs [3,4]. These 5 omics studies typically generate extensive lists of biological components, such as genes, 6 proteins, and metabolites, which can be ranked based on their abundance, fold change, 7 or other relevant attribute. The interpretation of such data presents a fundamental and 8 challenging bioinformatics task [5][6][7]. Functional enrichment methods are usually 9 employed to make sense of these long lists. In the most common use case, these 10 methods are applied to long lists of genes, ranked by differential expression values from 11 treatment-control transcriptome study (e.g., RNA-seq). The output typically consists of 12 a selection of functional categories that annotate a surprisingly large number of highly 13 ranked genes [8,9]. The returned functional categories provide a biological explanation 14 of the gene expression changes, facilitating a more comprehensive understanding of the 15 underlying biology. In this study, we adhere to this most common use case. Functional 16 enrichment methods are, however, versatile and applicable to other omics data [7,[10][11][12]. 17 Functional enrichment methods require an annotation database such as the Gene and global (category-level) statistics, determining significance, and adjusting for 45 multiple testing [18]. Users have the freedom to choose methods for each step based on 46 their data type and prior analysis, but this flexibility can lead to inconsistent outputs, 47 causing different users starting with the same expression data to make entirely different 48 discoveries. Although the derivatives exist to address specific GSEA issues, the lack of a 49 comprehensive benchmark and a universally recognized standard for their usage remains 50 unresolved [5,27]. 51 Furthermore, GSEA-based methods, along with ORA or any other methods that regard 52 each functional category individually require a multiple testing correction. This 53 correction can lead to a limited number of functional categories that are considered 54 significantly enriched [26]. The assumption of gene and gene set independence also fails 55 to consider the overlap among categories that arises from the hierarchical structure of categories [24,28,29]. REVIGO works by clustering highly similar GO categories based 62 on semantic similarity measures; the categories that remain in the list are cluster 63 representatives. These methods can be applied post-hoc to the output of functional 64 enrichment methods to avoid redundant results. 65 This review of the existing literature highlights that an ideal functional enrichment 66 method for ranked gene lists should fulfill three key criteria. Firstly, the method should 67 be able to identify functional categories that annotate a substantial number of 68 differentially expressed genes, with a preference for highly ranked genes. This ensures 69 that the method captures the most relevant biological insights by focusing on genes that 70 exhibit significant changes in expression. Secondly, the method should provide a 71 non-redundant set of categories, meaning that the identified categories should share 72 hardly any gene annotations. This avoids repetitive information and allows for a more 73 concise and focused interpretation of the results. Lastly, it is advantageous for the 74 method to prioritize specific categories over those with a large number of gene 75 annotations. Categories with fewer annotations tend to be more precisely defined and 76 offer more informative insights into the underlying biological processes. This preference 77 for specificity ensures that the method highlights well-defined biological processes that 78 are directly associated with the differentially expressed genes. 79 Model-based Gene Set Analysis (MGSA) is one of a few functional enrichment methods 80 that consider sets of functional categories, rather than each category individually [30].

81
By finding the set of functional categories that collectively best explains a given 82 expression dataset, MGSA manages to return a smaller non-redundant set of functional 83 categories. MGSA operates, however, only in binary mode and thus fails to leverage the 84 valuable information provided by gene ranking. To overcome this limitation, we propose 85 a novel method called Concise Ranked Functional Enrichment (CRFE). As MGSA, 86 CRFE classifies the genes into perturbed and unperturbed based on a user-defined 87 threshold. Contrary to MGSA, CRFE considers the specific ranking of all perturbed 88 genes. It returns a concise and non-redundant set of functional categories, which 89 prioritizes the inclusion of highly ranked genes among the returned categories.

90
In this manuscript, we introduce CRFE, and using four treatment-control RNA-seq 91 datasets and three functional enrichment performance metrics, we assess the 92 effectiveness of CRFE compared to the two widely used methods, ORA and GSEA, as 93 well as MGSA, which CRFE generalizes. We show that CRFE performs at least as well 94 as each other method in all three metrics, and strictly better than each other method in 95 at least one metric. We conclude with a brief biological interpretation of the top 96 functional categories returned by CRFE for two human lung adenocarcinoma RNA-seq 97 datasets, demonstrating the practical utility and effectiveness of CRFE in deciphering 98 the biological implications of gene expression data.

100
Generative model 101 We employ a generative model, as in MGSA [30], that explains how biological 102 experiments yield lists of thousands of genes, proteins, etc, ranked by some measured 103 value. For ease of exposition, we describe the model in the context of a 104 treatment-control RNA-seq experiment, which is the most frequent use case of 105 functional enrichment methods. Here, the genes are ranked by differential expression 106 value (e.g., fold change) between treatment and control. We model the response of genes 107 in the experiment as the result of the activation of several functional categories due to 108 the treatment. 109 We assume that the experiment seeks to detect gene states based on differential 110 expression, and that a gene can be in one of two general states, "perturbed," or 111 "unperturbed". The true state of every gene is hidden. We model the results of the 112 experiment by assuming that, in the absence of any noise, every gene that is in the 113 "perturbed" state is annotated to at least one activated functional category. However, 114 since experimental data contains noise, we further assume that unknown false positive 115 and false negative rates determine the observed state of every gene. 116 We now describe the generative model formally (see Figure 1 for an illustration). We 117 use biological processes in the Gene Ontology to represent functional categories. Let C 118 denote the set of all biological processes. In response to the treatment, a cell activates 119 an unknown proportion q ∈ (0, 1) of these processes independently. We denote the set of 120 all activated processes by C ⊂ C.

121
Genes annotated to the activated processes exhibit varying degrees of expression: some 122 genes are highly expressed due to the treatment, and are therefore more likely to exhibit 123 a high differential expression value. These genes are observed perturbed; other 124 annotated genes may display no difference in expression levels and are observed 125 unperturbed. This difference in expression gives rise to a stratification of the genes into 126 two sets: P , the perturbed genes, and U , the unperturbed genes. We assume that genes 127 annotated to an activated process are observed unperturbed with probability β, the false 128 negative rate. Similarly due to random noise, we assume that all genes not annotated to 129 any activated process are observed perturbed with probability α, the false positive rate. 130 We further assume that α and β are identical and independent for all genes.

131
This generative model can also be represented as a Bayesian network, wherein the sets 132 C and C − C constitute a hidden layer, and the gene sets P and U form the observed 133 layer. The hidden layer is conditioned on the parameter q, while the observed layer is 134 conditioned on the hidden layer as well as the parameters α and β. Illustration of the generative model. In response to the treatment, the cell activates a proportion q of biological processes. Genes annotated to the activated processes are observed perturbed or unperturbed in a differential expression analysis, with true positive rate 1 − β and false negative rate β, respectively. Likewise, genes not annotated to any activated processes are observed perturbed or unperturbed with false positive rate α, and true negative rate 1 − α, respectively. For this specific illustration, we use the values α = 1/5, β = 1/3, and q = 5/9. CRFE aims to recover the set of activated functional categories C that gave rise to the 137 observed ranked gene list, which can be divided into perturbed and unperturbed genes 138 based on a user-defined ranking cutoff τ . Common cutoffs include a 2-fold or 1.5-fold 139 change in expression values. In an ideal scenario, CRFE finds a set of categories C such 140 that all perturbed genes would be annotated to at least one category c ∈ C, while none 141 of the unperturbed genes would be annotated to any category c ∈ C. However due to 142 noise, this ideal solution is typically unachievable. We thus need to find a C that best 143 explains the observed gene list and define what "best" means.

144
For a given C, let E(C) be the set of genes annotated to at least one category c ∈ C. 145 We call elements in E(C) explained genes, and say that C explains the set of genes  Higher-ranked perturbed genes are more likely to represent true perturbations than 149 genes just above the ranking cutoff τ . It is therefore more important to explain highly 150 ranked genes. To achieve this emphasis, we divided the set P of perturbed genes into k 151 equally-sized subsets, P 1 , . . . , P k , based on the ranking and introduce subset-specific 152 weights w i later. Genes in P i are higher ranked (i.e., more differentially expressed) than 153 those in P i+1 . For computational efficiency and to prevent the loss of even marginal 154 information provided by the ranking, we considered k = |P |, i.e., each perturbed gene 155 forms its own subset. However, in theory, any value k ∈ 1, 2, . . . , |P | is feasible, with 156 k = 1 corresponding to the MGSA approach.
(i) EP i = E(C) ∩ P i contains the perturbed genes in subset P i that are annotated to 159 at least one category c ∈ C, 160 (ii) N P i = N (C) ∩ P i contains the perturbed genes in subset P i that are not 161 annotated to any category c ∈ C, 162 (iii) EU = E(C) ∩ U contains the unperturbed genes that are annotated to at least 163 one category c ∈ C, and 164 (iv) N U = N (C) ∩ U contains the unperturbed genes that are not annotated to any 165 category c ∈ C.

166
Each of these gene sets is a function of C. To simplify notation, we will omit the 167 dependence on C.

168
The joint probability of observing a ranked list of genes P = (P 1 , . . . , P k ), U given a set 169 C of activated categories, as well as parameters α, β, and q is then Since 0 < q << 1, the parameter q penalizes increases in the size of C. This joint probability is also the likelihood of (C, α, β, q) given the ranked gene list, denoted as L(C, α, β, q|P, U ). MGSA maximizes this likelihood function. With P split into P i , i = 1, . . . , k, we can rewrite Eq 1 as which can be modified by including subset-specific weights, enabling CRFE to put more 171 emphasis on explaining highly-perturbed genes. (i.e., L increases) to find a C that explains perturbed genes. To increase the importance 176 of explaining highly ranked perturbed genes, we introduced weights to replace 1 − β by 177 1 − βw i and α by αw i for genes in P i . Requiring w i < w i+1 ensures it is more important 178 to explain a higher-ranked gene. The likelihood function with these weights is We designed the weights w i with specific properties to facilitate the interpretation and 180 analysis ( Figure 2). 181 1. We assumed the weights increase at a fixed linear rate. That is, 2. We introduced a hyper-parameter b, which describes the difference between the 183 highest and lowest weight, June 30, 2023 6/27 If b = 1, all perturbed subsets are assigned equal weights, which is equivalent to 185 the approach used in MGSA. If b > 1, the likelihood decreases more if a highly 186 ranked perturbed gene is not explained. We call b the belief parameter, as the 187 choice of a higher value for b intuitively translates to an experimentalist having 188 greater confidence ("belief") that the genes measured to be highly differentially 189 expressed are the most relevant to the experiment. 190 3. We centered the weights at 1 to achieve w i ∈ (0, 2) and ultimately, These conditions lead to a linear system with a unique solution (w 1 , . . . , w k ) given by If each perturbed gene forms its own subset (i.e., k = |P |, |P i | = 1), as used in our 194 implementation of CRFE, this simplifies to When learning the parameters in the Markov chain Monte Carlo (MCMC) process (defined below), the possible range for α and β needs to be restricted in order to reward CRFE for explaining any perturbed gene. That implies In the MCMC process, α, β are independently updated within the range [0, r max ]. We thus require If k = |P |, this yields r max ≤ b+1 4b , which approaches 1 4 for very large choices of the To prioritize the explanation of highly perturbed genes by CRFE, we designed the weights such that the likelihood function is "rewarded more" if a highly perturbed gene is annotated to at least one functional category in the explanatory set C than if a less perturbed gene is explained by C. Note that even for the least perturbed gene (in P k ) the likelihood is higher if it is explained (i.e., if it is in EP k ). This holds as long as α, β < b+1 4b . Let θ = (C, α, β, q) and let Θ = {0, 1} C × A × B × Q be the solution space of θ where 199 A, B, Q contain all possible values of α, β, q, respectively. Finding a set C * of functional 200 categories that "best" explains the data reduces to maximizing the likelihood function 201 (Eq 2). That is, Even for a fixed choice of α, β, q, the solution space Θ has a size of 2 |C| , which is too 203 large to be explored exhaustively. Therefore, we employed a Markov Chain Monte Carlo 204 (MCMC) method to estimate C * and the parameters (α, β, q). For convenience, we 205 maximized the log-likelihood function, We initialized the set of activated processes as the empty set, C 0 = ∅. We further set 207 α = 0.1, β = 0.25, q = 1/|C| (since we learn α, β, the initial choices do not matter).

208
During each iteration j ≥ 1, we generated a new proposal set θ ′ j by one of the following 209 three actions: 210 1. with a probability of 80%, we updated C. That is, 211 (a) we added an unselected functional category (from C − C j−1 ) to the current 212 set C j−1 , or 213 (b) we deleted a category from the current set C j−1 , or 214 (c) we swapped a category in C j−1 with an unselected category. 215 2. with a probability of 10%, we updated α. 216 3. with a probability of 10%, we updated β.

217
As we ran the Markov chain for many iterations (i.e., until sufficient convergence had 218 occurred), the specific probabilities assigned to each action did not matter.

219
There are a total of |C| + |C j−1 ||C − C j−1 | possible proposals for the first action. We ensured that the penalty for adding categories remained sufficiently high. 226 We accepted the proposed change with probability Otherwise, the proposal was rejected and θ j remained unchanged. In other words, if the 228 likelihood of the proposed set was higher than the likelihood of the current set, the 229 proposal was always accepted. However, if the likelihood of the proposed set was lower, 230 we still accepted the proposal with a positive probability, which is inversely proportional 231 to the difference in likelihoods. This allowed the search process to occasionally explore 232 properties ensure convergence to a stationary distribution that can be sampled from, 236 providing us with the inferred set of activated functional categories.

237
To allow the Markov chain to reach a stationary distribution, we incorporated a burn-in 238 period of m = 50, 000 iterations, after which we recorded the current set θ j at each 239 iteration for an additional n = 50, 000 iterations. To determine the posterior probability 240 of a functional category c ∈ C, we calculated the fraction of recorded steps in which the 241 current set C j contained c. Mathematically, the posterior probability of c ∈ C is defined 242 as where C j−m represents the jth set recorded after the burn-in period. We ranked all Ontology (GO) Consortium in December 2022 [31], which records all leaf annotations in 251 the biological process aspect of human genes. The annotations were then propagated 252 using the Gene Ontology structure stored in an OBO file (containing the relationships 253 between GO categories) to respect the true path rule, i.e., any gene annotated to a GO 254 category is also annotated to the ancestors of that GO category. Next, we combined all 255 biological processes which annotate exactly the same genes (e.g., "growth" and 256 "developmental growth" are combined into " growth & developmental growth") while 257 other processes were retained. We did this to improve the performance and consistency 258 of the MCMC process. This way, the Markov chain is not forced to choose one of 259 possibly many equal categories arbitrarily (and assign potentially a high posterior 260 probability) and discard the others (and assign a low posterior probability).

261
Consequently, an annotation file, including combined GO categories and genes explained 262 by each corresponding category, is used as inputs to the functional enrichment methods. 263 For the first two datasets, we performed differential gene expression analysis using 284 DESeq2 [35] to get a list of genes ranked by log2-fold change (using normal samples as 285 the baseline). For the two COVID-19 datasets, we obtained the published log2-fold 286 change data and corresponding p-values from ExpressionAtlas [36]. The log2-fold 287 change values were rounded to one decimal place only and therefore exhibited many ties, 288 leading to arbitrariness in the gene order and negative effects on the outcomes of 289 functional enrichment methods. To increase the signal in the data, we broke most ties 290 by ranking by p-values as a secondary argument.

291
Implementing functional enrichment methods

292
Given a datsaset, we removed all genes from the annotation file that were not included 293 in the dataset. Next, we deleted categories that annotated fewer than 20 or more than 294 500 genes included in the dataset. Categories that are too large are not informative, 295 whereas categories that are too small have a substantial chance to be randomly enriched. 296 Whenever such an upper and/or lower cutoff on the categories size is used, the exact 297 composition of the investigated set of categories thus depends on the dataset. Removal 298 of these categories reduces the search space, leading to faster convergence of the MCMC 299 process.

300
Next, we deleted any genes that were not annotated to any categories (typically, none 301 are deleted in this step) and sorted the list of remaining genes in descending order based 302 on their differential expression levels (log2-fold change in the four datasets). This 303 ranked gene list constitutes the main input to the functional enrichment method besides 304 the annotation file. The dataset-specific size-filtering for categories is automatically 305 implemented in CRFE and MGSA, but not for GSEA and ORA; therefore, we manually 306 performed the filtering before running the latter two methods. 307 We used the CRFE software to run MGSA (with belief b = 1) and CRFE (with belief 308 b > 1) and the R package clusterProfiler for GSEA and ORA [19]. Specifically,

309
GSEA and ORA were performed using clusterProfiler::GSEA with fGSEA 310 option [26] and clusterProfiler::enricher, which enable users to provide a custom 311 annotation file. Because ORA is deterministic, we only ran it once for each dataset.

312
While GSEA includes a stochastic permutation test to compute the p-values, its main 313 calculation of the Enrichment Scores is deterministic. Repeated GSEA runs showed 314 minimal differences in the order of enriched categories and p-values; therefore, we only 315 ran GSEA once for each dataset. In contrast, the MCMC process underlying MGSA 316 and CRFE is inherently stochastic, we therefore repeated these analyses 100 times.

317
Except for GSEA, all methods require a threshold to divide the gene list into perturbed 318 and unperturbed genes. Rather than picking a specific differential expression threshold 319 τ , we chose τ such that, for each dataset, 30% of all genes were perturbed, and used 320 10% and 20% in sensitivity analyses. Depending on the dataset, the default threshold of 321 30% corresponds to an expression fold change threshold τ of 1.44-1.62, which is close to 322 the frequently used value of 1.5 [37,38] (Table 1). In addition, we post-processed the results of GSEA and ORA using REVIGO, as these 337 two methods are prone to return many overlapping categories [28]. We used the 338 REVIGO web interface with Homo sapiens Gene Ontology option at a medium allowed 339 similarity (default) and SimRel as semantic similarity measure (default), and obsolete 340 categories were not replaced. REVIGO-processed results contained fewer categories as 341 each group of overlapping categories was collapsed into one representative category; we 342 sorted the collapsed categories using the same ranking (p-value or NES) as the original 343 list. Generally, functional categories with fewer gene annotations are more informative [15]. 355 Therefore, a good method should preferably return specific categories. The biological Functional enrichment methods seek to find a set of functional categories that annotate 362 as many perturbed genes and as few unperturbed genes as possible. These two goals 363 conflict with each: an improvement in one usually leads to a deterioration in the other. 364 To quantify this trade-off, we defined a new measure, which we call quality. Given a set 365 C of categories and a subset X of the perturbed genes P , the quality Q(C, X) of C with 366 respect to X is the ratio of (a) the proportion of genes in X that are explained by C 367 and (b) the proportion of unperturbed genes that are explained by C. That is, By defining the quality for any subset of the perturbed genes P , we can ask how it 369 varies for different prefixes of the perturbed gene list (i.e., the top x% of P ). If C = C, 370 the quality is 1 as all genes are explained. If C does not explain any unperturbed genes, 371 the quality equals infinity although for |C| ≥ 20 and with |U | = 70% of all genes, this 372 never occurred in our study.

373
Given a ranked list of categories returned by a functional enrichment method, we 374 created a diagnostic plot, the quality plot, as follows. Starting with the highest-ranked 375 category as C, we plotted the quality Q(C, X) against the proportion of genes in X 376 that are explained by C, |E(C) ∩ X|/|X|, and repeated this as we added categories to 377 C one at a time. We extrapolated each quality curve to the left boundary (i.e., 0 genes 378 explained) with the same quality score as the first returned category and linearly 379 interpolated to the right boundary at (1, 1), which is achieved when C = C. We used Contrary to standard diagnostic plots such as the receiver operating characteristic or 384 the precision-recall plot, a subset of perturbed genes P − X is not taken into account in 385 the calculation of Q(C, X), unless X = P . This is so that the quality, when computed 386 for different X, can highlight how well a method explains the highly perturbed genes 387 without affecting the quality scores if these methods capture less perturbed genes.

389
Using the biological process aspect of the Gene Ontology and four published RNA-seq 390 treatment-control differential gene expression datasets, we computed three metrics to (corresponding to perturbed genes exhibiting a fold-change of 1.44 or higher, Table 1). 407 For GSEA, we ranked the enriched biological processes both by p-value (as frequently 408 done by users) and by their normalized enrichment score (NES, which we show is 409 superior). Lastly, we only considered GO biological processes that annotate between 410 20 − 500 genes included in the expression data.  Figure 3). Even after a Benjamini-Hochberg correction for multiple testing, GSEA and 425 ORA were on average less selective than CRFE.
426 Among the two model-based methods, CRFE appears even more selective. Across the 427 four datasets, MGSA consistently returned the most categories with an average 428 posterior higher than 1%, followed by CRFE b = 2; CRFE b = 5 and b = 10, which 429 returned roughly the same and lowest number of categories (Figure 3). This is likely 430 due to the specifics of the MCMC process. To avoid getting trapped in a local optimum, 431 the MCMC process accepts any proposal (e.g., the addition of any category) with a 432 positive probability. However, the acceptance probability (Eq 3) decreases as the belief 433 parameter increases, rendering the MCMC process more restrictive. Interestingly, the number of categories with an average posterior higher than 10% (lighter bars of MGSA 435 and CRFE in Figure 3) did not vary substantially when changing the belief parameter 436 from 1 (MGSA) to 10. Categories with an average posterior of 10% or higher are likely 437 truly enriched for perturbed genes, as a randomly added category is typically long 438 removed before it accumulates 10% average posterior. Differences in the acceptance 439 probability due to the belief parameter are small enough to barely affect categories with 440 a true signal.

441
Researchers analyzing functional enrichment results often face long lists of functional 442 categories that possess highly similar gene annotations due to the hierarchical structure 443 of annotation databases such as the GO. These redundant lists are difficult to interpret 444 and inflate the number of biologically meaningful results [28]. The average redundancy 445 of a set of categories is measured by the average Jaccard similarity of gene annotations 446 between all pairs of categories within the set. The higher the mean pairwise Jaccard 447 similarity, the higher the average redundancy, and the more redundant a set of 448 categories is. For all four datasets, the set of all biological processes, which annotate 449 between 20 and 500 genes included in the dataset, is highly similar. The overall average 450 redundancy of this entire set (i.e., the expected redundancy when randomly picking two 451 categories) is 0.007-0.008 across all datasets.  predominantly returned larger categories, likely due to the well-known fact that the 477 p-value decreases when "sample size" increases, even when the "effect size" remains 478 fixed [17,39]. The EENADCL dataset formed an exception in that all methods except 479 for ORA, even GSEA p-value, returned quite specific categories. Post-hoc processing by 480 REVIGO did not substantially alter the number of genes annotated to the top returned 481 categories. This makes sense since REVIGO does not consider category size when 482 selecting a representative category for a set of highly similar categories.

483
Quality 484 Functional enrichment methods should explain highly perturbed genes particularly well, 485 as these genes are the most likely ones to be annotated to a functional category that 486 was activated by the cell in response to some perturbation such as the development of 487 cancer. Each method ranks the returned biological processes by some metric (e.g.,

488
p-value, average posterior). To compare how well each method explains the 489 highest-ranked genes, we varied both the number r of categories included in the 490 explanatory set and the percentage x% of genes considered highly ranked. As we varied 491 r, we plotted the quality of the explanatory set against the proportion of top x% genes 492 that was explained by the r categories. Figure 6A displays a representative quality plot 493 for x = 15%. Recall that for the threshold-based methods, ORA, MGSA and CRFE, we 494 considered 30% of all genes perturbed, corresponding to perturbed genes having a fold 495 change of 1.45 or higher (  Compared to MGSA, which does not take the ranking of perturbed genes into account, 504 CRFE clearly succeeded in finding sets of categories that annotate more highly ranked 505 genes at high quality ( Figure 6A). The quality of ORA and GSEA, which consider each 506 category individually, quickly dropped although these methods at least partially 507 succeeded at finding initial categories with very high quality. For example, the first 508 biological process returned by ORA+REVIGO is DNA-templated DNA replication 509 with a quality of 6.6. This category is also selected by all other methods but not as the 510 first one in the enriched list. Both MGSA and CRFE select this category within the 511 first ten categories, with average posterior probabilities of 0.66 for MGSA and about 512 0.85 for CRFE, irrespective of the specific choice of tested belief parameter.

513
Next, we tested the sensitivity of these general trends by varying x, the percentage of 514 genes considered positive, from 3% (i.e., 10% of all genes that are considered perturbed 515 by ORA, MGSA, and CRFE) to 30% (i.e., 100%, respectively). To summarize the 516 performance described by each quality plot, we computed the area under the curve 517 (AUC). CRFE at b = 5 and b = 10 performed equally well ( Figure 6B). However even at 518 x = 30%, these two methods outperformed all other methods. For lower x, the 519 difference in AUC values increased substantially, clearly highlighting that CRFE excels 520 at explaining highly ranked or most perturbed genes. Furthermore, GSEA ranked by 521 NES outperformed GSEA ranked by p-value. This is likely due to the preference for 522 large categories when ranking categories by p-value, through which several undesired 523 unperturbed genes are explained, lowering the quality. Notably, all methods explained 524 the top ranked genes at highest quality, even ORA and MGSA, which do not consider 525 the ranking of the perturbed genes. This provides evidence that one of the assumptions 526 made by CRFE, namely that false positives appear more frequently at the lower end of 527 the perturbed genes, is correct.

545
Interpretation of enriched category sets 546 We performed a high-level interpretation of the results of CRFE for the two lung 547 adenocarcinoma datasets. In our analysis we only considered positively differentially 548 expressed genes as perturbed; therefore, each of the top 20 enriched biological processes 549 reported in S1 Table and S2 Table represents activation in response to cancer. As 550 expected, the set of the enriched processes for the two datasets is similar, indicating a 551 consistent pattern of gene expression changes associated with lung adenocarcinoma.

552
Most of the ten hallmarks of cancer appeared among the top enriched biological 553 processes [40]. For instance, the enriched category intermediate filament 554 organization alone is associated with eight hallmarks [41]. Categories such as 555 collagen metabolic process, alpha-amino acid biosynthetic process, and 556 nucleoside diphosphate metabolic process are relevant to the hallmarks of 557 "Evading immune destruction" and "Activating invasion and metastasis" [42][43][44].

566
GSEA and ORA-based approaches constitute the most commonly used methods [6].

567
These methods regard each functional category individually, necessitating a false 568 discovery rate control step that often eliminates a significant portion of the results. method CRFE, which generalizes MGSA by incorporating the specific ranking of genes, 574 overcomes all these limitations.

575
CRFE contains two important hyperparameters: the belief and the proportion/threshold 576 of perturbed genes. A higher belief parameter leads to a higher quality score when 577 focusing on the small subset of most perturbed genes (Figures 6, 7). When the focus is 578 on a larger proportion of perturbed genes, a lower belief parameter should be used.

579
Furthermore, a higher belief parameter yielded fewer enriched functional categories 580 because the conditions imposed on acceptance of a category in the MCMC process 581 became more restrictive (Table 3). Therefore, the choice of the belief parameter should 582 depend on what proportion of perturbed genes a user wants to explain with a high 583 quality. Experimalists often set the second hyperparameter, the threshold for perturbed 584 genes, at expression fold changes of 1.5 or 2. When given the choice, we generally 585 encourage users to choose looser thresholds since genes with a mediocre fold change still 586 contain valuable information. Rather than choosing a specific fold change threshold, 587 users can also simply consider a fixed proportion of all genes perturbed, as we did here. 588 In evaluating the performance of functional enrichment methods in this study, we made 589 an interesting observation about GSEA. Ranking the enriched functional categories by 590 p-value versus NES value yielded on average larger and more redundant categories, 591 which explained the perturbed genes at a lower quality. This is likely due to the bias of 592 the p-value towards larger categories due to their increased statistical power. On the 593 other hand, the NES value is not biased towards larger categories.

594
In conclusion, we showed that CRFE consistently outperforms the other methods when 595 considering three key criteria. While all other methods exhibited poor performance in at 596 least one criterion, CRFE returns a concise and specific set of functional categories that 597 effectively explains highly perturbed genes with high quality. Using the adenocarcinoma 598 data, we highlighted that the biological processes enriched by CRFE are biologically