Abstract
Motivation Canonical methods for gene-set enrichment analysis assume independence between gene-sets. In practice, heterogeneous gene-sets from diverse sources are frequently combined and used, resulting in gene-sets with overlapping genes. They compromise statistical modelling and complicate interpretation of results.
Results We rephrase gene-set enrichment as a regression problem. Given some genes of interest (e.g. a list of hits from an experiment) and gene-sets (e.g. functional annotations or pathways), we aim to identify a sparse list of gene-sets for the genes of interest. In a regression framework, this amounts to identifying a minimum set of gene-sets that optimally predicts whether any gene belongs to the given genes of interest. To accommodate redundancy between gene-sets, we propose regularized regression techniques such as the elastic net. We report that regression-based results are consistent with established gene-set enrichment methods but more parsimonious and interpretable.
Availability We implement the model in gerr (gene-set enrichment with regularized regression), an R package freely available at https://github.com/TaoDFang/gerr and submitted to Bioconductor. Code and data required to reproduce the results of this study are available at https://github.com/TaoDFang/GeneModuleAnnotationPaper.
Contact Jitao David Zhang (jitao_david.zhang{at}roche.com), Roche Pharma Research and Early Development, Roche Innovation Center Basel, F. Hoffmann-La Roche Ltd. Grenzacherstrasse 124, 4070 Basel, Switzerland.
Footnotes
We provided more details about the simulation study with GO gene-sets, and made other modifications including references and discussions. The methodology and software described remain largely unchanged.