PT - JOURNAL ARTICLE AU - Yue Li AU - Si Yi Li AU - Wenmin Zhang AU - Tianyi Liu TI - Partitioning gene-based variance of complex traits by gene score regression AID - 10.1101/2020.01.08.899260 DP - 2020 Jan 01 TA - bioRxiv PG - 2020.01.08.899260 4099 - http://biorxiv.org/content/early/2020/01/09/2020.01.08.899260.short 4100 - http://biorxiv.org/content/early/2020/01/09/2020.01.08.899260.full AB - Motivation Understanding the biological mechanism of complex phenotypes is challenging due to the lack of efficient approaches that can associate the vast majority of the genome-wide association studies (GWAS) loci in the non-coding regions of the human genome with relevant genes and ultimately the downstream pathways. Transcriptome-wide association studies (TWAS) provides a way to associate genes with phenotype of interest by correlating GWAS summary statistics with expression quantitative trait loci (eQTL) summary statistics obtained from a reference panel. However, genes that are correlated by the predicted gene expression may exhibit high TWAS statistic even though they are the causal genes for the trait. Existing gene set enrichment analysis assume independence of genes and may therefore lead to false discoveries.Results We propose a novel statistical method called Gene Score Regression (GSR). The rationale of GSR is based on the insight that genes that are highly correlated with the causal genes in the causal gene set or pathways will exhibit high marginal TWAS statistic. Consequently, by regressing on the genes’ marginal statistic using the sum of the gene-gene correlation scores in each gene set, we can assess the amount of phenotypic variance explained by the predicted expression of the genes in that gene set. Our approach does not only operates on summary statistics without requiring individual genotype and phenotype but can also work with observed gene expression and phenotype data without the need of genotype. Based on simulation, GSR demonstrates superior power and better controlled Type I Error rate over the existing methods.We applied GSR to investigate 28 complex traits using diverse genome-wide and knowledge-based gene sets such as REACTOME and KEGG from MSigDB and also 205 cell-type-specific gene sets derived from observed gene expression. The significant gene sets detected by GSR are by and large consistent with the known biology of the traits. We also demonstrated the utility of using GSR on cancer data where only gene expression and tumor/normal tissue labels are available. Overall, GSR is not only accurate but efficient method usually taking less than 5 minutes to perform the full analysis on one phenotype, thereby presenting as a significantly useful novel contribution to the analytical bioinformatic pipeline.Availability The GSR software is available at GitHub at https://github.com/li-lab-mcgill/GSR.