Abstract
Motivation Canonical methods for gene-set enrichment analysis assume independence between gene-sets. While the assumption may be reasonable when the redundancy is low, its validity breaks down when gene-sets are overlapping or even redundant with each other. In practice, heterogeneous gene-sets from different sources are often used, leading to hit gene-sets that are partially or fully overlapping, which compromises statistical modelling and complicates results interpretation.
Results We rephrase gene-set enrichment as a regression problem by treating genes-of-interest membership as a binary target variable, and gene-set membership as binary dependent variables. The goal is to identify a minimum set of gene-sets that best predict whether or not a gene belongs to a set of genes of interest. To accommodate redundancy between gene-sets, we propose to solve the problem with regularized regression techniques such as the elastic net. We found that regression-based results are consistent with established methods, but much more sparse and therefore interpretable.
Availability We implement the model in an R package, gerr (gene-set enrichment with regularized regression), which is freely available at https://github.com/TaoDFang/gerr and has been submitted to Bioconductor. The scripts and the data used in this paper are available at https://github.com/TaoDFang/GeneModuleAnnotationPaper.
Contact Jitao David Zhang (jitao_david.zhang{at}roche.com), Roche Pharma Research and Early Development, Roche Innovation Center Basel, F. Hoffmann-La Roche Ltd. Grenzacherstrasse 124, 4070 Basel, Switzerland.