Abstract
More than a decade of genome-wide association studies (GWASs) have identified genetic risk variants that are significantly associated with complex traits. Emerging evidence suggests that the function of trait-associated variants likely acts in a tissue- or cell-type-specific fashion. Yet, it remains challenging to prioritize trait-relevant tissues or cell types to elucidate disease etiology. Here, we present EPIC (cEll tyPe enrIChment), a statistical framework that relates large-scale GWAS summary statistics to cell-type-specific omics measurements from single-cell sequencing. We derive powerful gene-level test statistics for common and rare variants, separately and jointly, and adopt generalized least squares to prioritize trait-relevant tissues or cell types while accounting for the correlation structures both within and between genes. Using enrichment of loci associated with four lipid traits in the liver and enrichment of loci associated with three neurological disorders in the brain as ground truths, we show that EPIC outperforms existing methods. We extend our framework to single-cell transcriptomic data and identify cell types underlying type 2 diabetes and schizophrenia. The enrichment is replicated using independent GWAS and single-cell datasets and further validated using PubMed search and existing bulk case-control testing results.
Introduction
Many years of genome-wide association studies (GWASs) have yielded genetic risk variants associated with complex traits and human diseases. Emerging evidence suggests that the function of trait-associated variants likely acts in a tissue- or cell-type-specific fashion1-5. Recent advances in transcriptomic sequencing, including bulk RNA sequencing (RNA-seq)6 and single-cell RNA sequencing (scRNA-seq)7-9, enable characterization of tissue- and cell-type-specific gene expression. Combining the tissue- or cell-type-specific gene expression profiles with GWAS summary statistics provides a better understanding of genetic regulatory effects with increased resolution10-13. Along this line of research, recent studies have identified specific brain cell types that underlie neuropsychiatric disorders, such as schizophrenia14 and Parkinson’s disease15, revealing that scRNA-seq data can offer finer-resolution insights that help to elucidate disease etiology.
Several methods16-20 have been developed to integrate tissue- or cell-type-specific gene expression profiles with GWAS summary statistics to prioritize trait-relevant tissues and cell types. One set of methods, including RolyPoly16 and LDSC-SEG18, develops models on the single-nucleotide polymorphism (SNP) level and derives SNP-wise annotations from the transcriptomic data. RolyPoly adopts a polygenic model, and the effect sizes of all SNPs associated with a gene have a covariance that is a linear combination of the gene expressions across all tissues or cell types. RolyPoly, therefore, captures the effect of the cell-type-specific gene expression on the covariance of GWAS effect sizes. LDSC-SEG also constructs SNP annotations from tissue- or cell-type-specific gene expressions and then carries out a one-sided test using the stratified LD score regression framework18,21-23. It tests whether trait heritability is enriched in regions surrounding genes that have the highest specific expression in a given tissue or cell type.
Another set of methods, such as CoCoNet19 and MAGMA14,15,17,24, does not devise the SNP-level framework. These methods first derive gene-level association statistics since this more naturally copes with the gene-level expression measurements; they then prioritize risk genes in a specific tissue/cell type. Specifically, CoCoNet models gene-level association statistics as a function of the tissue-specific adjacency matrices inferred from gene expression studies. While CoCoNet is the first method to evaluate the gene co-expression networks, its rank-based method does not allow hypothesis testing due to the strong correlation among gene co-expression patterns constructed from different tissues and cell types. Like CoCoNet, MAGMA17 and MAGMA-based approaches14,15,24 also begin by combining SNP-level GWAS summary statistics into gene-level statistics. This step is followed by a second “gene-property” analysis, where the tissue- and cell-type-specific gene expressions are regressed against the genes’ GWAS test statistics. The various versions of the methods adopt different ways to select genes, transform the outcome and predictor variables, and include different sets of additional covariates14,15,24. While MAGMA-based methods have been successfully used in several studies25-27, Yurko et al.28 examined the statistical foundation of MAGMA, and they identified an issue: type I error rate is inflated because the method incorrectly uses the Brown’s approximation when combining the SNP-level p-values. In addition to this problem, we noticed that the MAGMA’s implementation uses squared correlations between SNPs, which masks the true LD structure.
When modeling on the gene level, one needs to account for the gene-gene correlations. RolyPoly ignores proximal gene correlations but implements a block bootstrapping procedure as a correction. MAGMA approximates the gene-gene correlations as the correlations between the model sum of squares from the second-step gene-property analysis. However, the gene-gene correlation of the effect sizes should be a function of the LD scores (i.e., the correlations between the SNPs within the genes). CoCoNet does not take account of this either, instead using LD information only to calculate the gene-level effect sizes and assuming that gene-gene covariance is a function solely of gene co-expression. A statistically rigorous and computationally efficient method to derive the gene-gene correlation structure while incorporating the SNP-level LD information is needed.
These existing methods either focus on common variants (e.g., RolyPoly and LDSC-SEG) or do not differentiate between common and rare variants (e.g., MAGMA with only summary statistics) due to the limited statistical power for rare variants. While methods for rare-variant association analysis have been developed (e.g., sequence kernel association test29 and burden test30), to our best knowledge, no methods are currently available to detect tissue and/or cell-type enrichment of GWAS risk loci using summary statistics for both common and rare variants.
Here, we propose EPIC, a statistical framework to identify trait-relevant tissues or cell types by integrating tissue- or cell-type-specific gene expression profiles and GWAS summary statistics. We adopt gene-based generalized least squares to identify enrichment of risk loci. For the prioritized tissues and cell types, EPIC further carries out a gene-specific influence analysis to identify significant genes. We demonstrate EPIC on multiple tissue-specific bulk RNA-seq and scRNA-seq datasets, along with GWAS summary statistics of four lipid traits, three neuropsychiatric disorders, and type 2 diabetes, and successfully replicate and validate the prioritized tissues and cell types. Together, EPIC’s integrative analysis of cell-type-specific expressions and GWAS polygenic signals help to elucidate the underlying cell-type-specific disease etiology and prioritize important functional variants. EPIC is compiled as an open-source R package available at https://github.com/rujinwang/EPIC.
Material and Methods
Overview of methods
The goal of EPIC is to identify disease- or trait-relevant tissues or cell types. An overview of the framework is outlined in Figure 1. EPIC takes as input single-variant summary statistics from GWAS, which is used to aggregate SNP-level associations into genes, and gene expression datasets from either bulk tissue or single-cell RNA-seq. An external reference panel is adopted to account for the linkage disequilibrium (LD) between SNPs and genes. We first perform gene-level testing based on GWAS summary statistics from the single-variant analysis. The multivariate statistics for both common and rare variants can be recovered using covariance of the single-variant test statistics, which can be estimated from either the participating study or from a public database. We then develop a gene-based regression framework that can prioritize trait-relevant cell types from gene-level test statistics and cell-type-specific omics profiles while accounting for gene-gene correlations due to LD. The underlying hypothesis is that if a particular cell type influences a trait, then more of the GWAS polygenic signals would be concentrated in genes with greater cell-type-specific gene expression. For significantly enriched tissue or cell type, we further carry out a gene-specific influence analysis to identify genes that are highly influential in leading to the significance of the prioritized tissue or cell type.
Gene-level associations for common variants
Let β= (β1, …, βK)T be the effect sizes of K common variants within a gene of interest. Let be the estimators for β, with corresponding standard errors . Let be the z -scores, where is the standard-normal statistic for testing the null hypothesis of no association for SNP j. We approximate the correlation matrix of (equivalent to the covariance matrix of ) by the LD matrix R = {Rjl; j, l = 1, …, K}, where Rjl is the Pearson correlation between SNP j and SNP l. We further define as the covariance matrix of . We have under the null. To perform gene-level association testing for common variants, we construct a simple and powerful chi-square statistic for testing the null hypothesis of β= 0: The correlation matrix R can be estimated from either the participating study or a publicly available reference panel. In this study, we utilize the 1000 Genomes Project European panel31, which comprises genotypes of ∼500 European individuals across ∼23 million SNPs.
An effective chi-square test described above requires the covariance matrix to be well-conditioned. For most GWASs, the ratio of the number of SNPs and the number of subjects is greater than or close to one, making the sample covariance matrix ill-conditioned32,33. In these cases, smaller eigenvalues of the sample covariance matrix are underestimated32, leading to inflated false positives in the gene-level association testing. To solve this issue, we choose to adopt the POET estimator34, a principal orthogonal complement thresholding approach, to obtain a well-conditioned covariance matrix via sparse shrinkage under a high-dimensional setting. The estimator of V = {Vjl; j, l = 1, …, K} is defined as , where is the j th eigenvalues of the covariance matrix with corresponding eigenvector is obtained from applying adaptive thresholding on , and H is the number of spiked eigenvalues. The degree of shrinkage is determined by a tuning parameter, and we choose one so that the positive definiteness of the estimated sparse covariance matrix is guaranteed. Notably, other sparse covariance matrix estimators32,33,35,36 can also be used in a similar fashion.
Gene-level associations for rare variants
Recent advances in next-generation sequencing technology have made it possible to extend association testing to rare variants, which can explain additional disease risk or trait variability37-39. Previous work40 has demonstrated that the gene-level testing of rare variants is powerful and able to achieve well-controlled type I error as long as the correlation matrix of single-variant test statistics can be accurately estimated. Here, we recover the burden test statistics from GWAS summary statistics for the gene-level association testing of rare variants. Suppose that a total of M rare variants residing in a gene are genotyped. Let U = {Uj; j = 1, …, M} and C = {Cjl; j, l = 1, …, M} be the score statistic and the corresponding covariance matrix for testing the null hypothesis of no association. Under H0, the burden test statistic follows a standard normal distribution, where ξM×1 = (1, …, 1)T. We approximate Uj and Cjl by where R is the correlation or covariance matrix of and is an empirical approximation to . Denote w = (w1, …, wM)T. The burden test uses , which follows a chi-square distribution with one degree of freedom under the null .
Joint analysis for common and rare variants
Existing methods either remove rare variants from the analysis16,18 or do not differentiate common and rare variants when only summary statistics are available17. Yet, existing GWASs have successfully uncovered both common and rare variants associated with complex traits and diseases15,37-39, and rare variants should therefore not be ignored in the enrichment analysis. To incorporate rare variants into the common-variant gene association testing framework, we collapse genotypes of all rare variants within a gene to construct a pseudo-SNP. We then treat the aggregated pseudo-SNP as a common variant and concatenate the z-scores , where the first K elements are from the common variants and is from the burden test statistic for the combined rare variants. A joint chi-square test for common and rare variants is performed as below: where R* can be estimated using POET shrinkage with the pseudo-SNP included.
Gene-gene correlation
Proximal genes that share cis-SNPs inherit LD from SNPs and result in correlations among genes. Since the correlations between genes are caused by LD between SNPs, which quickly drops off as a function of distance, we adopt a sliding-window approach to only compute correlations for pairs of genes within a certain distance from each. It is worth noting that this also significantly reduces the computational burden. Specifically, let N be the number of genes from the same chromosome, and we adopt a sliding window of size d to estimate the sparse covariance matrix among genes {G1, …, Gd}, {G2, … Gd+1}GN−d+1, …, GN⍰, respectively. By default, we set d = 10 so that gene-wise correlations can be recovered for a gene with its 18 neighboring genes (see Supplementary Figure S1 for the effect of sliding window size on EPIC’s performance). Similar to MAGMA, correlations are only computed for pairs of genes within 5 megabases by default.
Recall that the gene-level association statistics are chi-square statistics in a quadratic form. Within a specific window, the gene-wise correlations are obtained via transformations of the SNP-wise LD information. Let and be the SNP-wise z-scores for genes s and t, respectively. Let , and be the within- and between-gene correlation matrices obtained from the POET shrinkage estimation. We take advantage of the Cholesky decomposition to obtain the gene-gene correlation between and : where Lij’s are entries of a lower triangular matrix L such that and IK is the identity matrix with dimension K. The full derivation is detailed in Supplementary Note S1. When rare variants are included in the framework, gene-gene correlations are calculated similarly by aggregating all rare variants that reside in a gene as a pseudo-SNP.
Prioritizing trait-relevant tissue(s) and cell type(s)
To detect tissue- or cell-type-specific enrichment for a specific trait of interest, we devise a regression framework based on generalized least squares to identify risk loci enrichment. The key underlying hypothesis is that if a particular cell type influences a trait, more GWAS polygenic signals would be concentrated in genes with greater cell-type-specific gene expression. Under this hypothesis, genes that are significantly associated with lipid traits are expected to be highly expressed in the liver since the liver is known to participate in cholesterol regulation. This relationship between the GWAS association signals and the gene expression specificity is modeled as below.
Let Qg be the gene-level chi-square association test statistic for gene g. To account for the different number of SNPs within each gene, we adjust the degree of freedom of Kg + 1 to obtain Yg = Qg/(Kg + 1), which is included as the outcome variable. For each cell type c, to test for its enrichment we fit a separate regression using its cell-type-specific gene expression Ecg (reads per kilobase million (RPKM) or transcripts per million (TPM)) as a dependent variable. To account for the baseline gene expression24, we also include another covariate , which is the average gene expression across all T tissues/cell types. Taken together, we have where ∈ ∼ MVN(0, σ2W), W = DPDT, , and P = {ρst} is the gene-gene correlation matrix. We adopt the generalized least squares approach to fit the model and perform a one-sided test against the alternative γc > 0, under which the gene-level association signals positively correlated with the cell-type-specific expression. For a significantly enriched tissue or cell type, we further carry out a statistical influence test to identify a set of tissue- or cell-type-specific influential genes, using the DFBETAS statistics41—large values of DFBETAS indicate observations (i.e., genes) that are influential in estimating γc. With a size-adjusted cutoff , where N is the number of genes used in the tissue- or cell-type-specific enrichment analysis, significantly influential genes allow for further pathway or gene set enrichment analyses.
GWAS summary statistics and transcriptomic data processing
We adopt GWAS summary statistics of eight traits, including four lipid traits42 (low-density lipoprotein cholesterol (LDL), high-density lipoprotein cholesterol (HDL), total cholesterol (TC), and triglyceride levels (TG)), three neuropsychiatric disorders39,43,44 (schizophrenia (SCZ), bipolar disorder (BIP), and schizophrenia and bipolar disorder (SCZBIP)), and type 2 diabetes38 (T2D). The relevant tissues involved in these traits are well known/studied – liver for the lipid traits, brain for the neuropsychiatric disorders, and pancreas for the T2D – and we use this as ground truths to demonstrate EPIC and to benchmark against other methods. See Supplementary Table S1 for more information on the GWASs.
For each trait, we obtain SNP-level summary statistics and apply stringent quality control procedures to the data. We restrict our analyses to autosomes, filter out SNPs not in the 1000 Genomes Project Phase 3 reference panel, and remove SNPs with mismatched reference SNP ID numbers. We exclude SNPs from the major histocompatibility complex (MHC) region due to complex LD architecture16,19,22. In addition to SNP filtering, we align alleles of each SNP against those of the reference panel to harmonize the effect alleles of all processed GWAS summary statistics. A gene window is defined with 10kb upstream and 1.5kb downstream of each gene14, and SNPs residing in the windows are assigned to the corresponding genes.
In the analysis that follows, we uniformly report results using a minor allele frequency (MAF) cutoff of 1% to define common and rare variants (see Supplementary Figure S2 for enrichment results with different MAF cutoffs). To reduce the computational cost and to alleviate the multicollinearity problem, we perform LD pruning using PLINK45 with a threshold of r2 ≤ 0.8 to obtain a set of pruned-in common variants, followed by a second-round of LD pruning if the number of common SNPs per gene exceeds 200. See Supplementary Figure S3 for results with varying LD-pruning thresholds. For rare variants, we only carry out a gene-level rare variant association testing if the minor allele count (MAC), defined as the total number of minor alleles across subjects and SNPs within the gene, exceeds 20. We report the number of SNPs (common variants and rare variants), the number of genes, and the number of SNPs per gene for each GWAS trait in Supplementary Table S2.
We adopt a unified framework to process all transcriptomic data. For scRNA-seq data, we follow the Seurat46 pipeline to perform gene- and cell-wise quality controls and focus on the top 8000 highly variable genes. Cell-type-specific RPKMs are calculated by combining read or UMI counts from all cells of a specific cell type, followed by log2 transformation with an added pseudo-count. For tissue-specific bulk RNA-seq data from the Genotype-Tissue Expression project (GTEx), we first calculate a tissue specificity score for each gene19,47, and only focus on genes that are highly specific in at least one tissue. See Supplementary Note S2 for more details. We then perform log2 transformation on the tissue-specific TPM measurements with an added pseudo-count.
Benchmarking against RolyPoly, LDSC-SEG, and MAGMA
We benchmarked EPIC against three existing approaches: RolyPoly16, LDSC-SEG18, and MAGMA17. For all methods, we used RPKMs for each cell type and TPMs for each GTEx tissue in the benchmarking analysis. We made gene annotations the same for RolyPoly, MAGMA, and EPIC by defining the gene window as 10kb upstream and 1.5kb downstream of each gene. For LDSC-SEG, as recommended by the authors18, the window size is set to be 100kb up and downstream of each gene’s transcribed region. Since all methods adopt a hypothesis testing framework to identify trait-relevant tissue(s), for each trait-tissue pair, we reported and compared the corresponding p-values from the different methods.
RolyPoly takes as input GWAS summary statistics, gene expression data, gene annotations, and LD matrix from the 1000 Genomes Project Phase 3. As recommended by the developer for RolyPoly16, we scaled the gene expression for each gene across tissues/cell types and took the absolute values of the scaled expression values. We performed 100 block bootstrapping iterations to test whether a tissue- or cell-type-specific gene expression annotation was significantly enriched in a joint model across all tissues or cell types. We also benchmarked LDSC-SEG, which computes t-statistics to quantify differential expression for each gene across tissues or cell types. We annotated genome-wide SNPs using the top 10% genes with the highest positive t-statistics and applied stratified LDSC to test the heritability enrichment of the annotations that were attributed to specifically expressed genes for each tissue. For MAGMA, we first obtained gene-level association statistics using MAGMA v1.08. We then carried out the gene-property analysis proposed in Watanabe et al.24, with technical confounders being controlled by default, to test the positive relationship between tissue- or cell-type specificity of gene expression and genetic associations.
Results
Inferring trait-relevant tissues using bulk RNA-seq from GTEx
We started our analysis with tissue-specific transcriptomic profiles from the GTEx v86, which consists of bulk-tissue gene expression measurements of 17,382 samples from 54 tissues across 980 postmortem donors (Supplementary Table S1). Tissues with fewer than 100 samples were removed from the analysis. After sample-specific quality controls, we obtained gene expression profiles of 45 tissues, averaged across samples. For subsequent analyses, we focused on a set of 8,708 genes with tissue specificity scores greater than 5. We applied EPIC to the GTEx data with GWAS summary statistics for eight diseases and traits, including four lipid traits, three neuropsychiatric disorders, and T2D.
We first performed the gene-level chi-square association test with the shrinkage estimators and sliding-window approach. The quantile-quantile (Q-Q) plots of gene-level p-values are shown in Supplementary Figure S4, with a comparison against MAGMA. We observed elevated power in the Q-Q plots for four lipid traits. In Supplementary Table S3, we summarized a list of genes that have been shown to modulate lipid levels42 and compared the gene-level association testing results from EPIC and MAGMA. Significant gene-level associations were detected between all lipid traits and variants in APOB, APOE, and CETP. Meanwhile, PCSK9, ABCG5, and ABCG8 exhibited significant associations with LDL and TC. For neuropsychiatric disorders, we examined genes that are relevant to the etiology of schizophrenia39, including genes that are targets of therapeutic drugs (DRD2 and GRM3), genes that participate in neuronal calcium signaling (CACNA1I), and genes that are involved in synaptic function (CNTN4 and SNAP91) and other neuronal pathways (FXR1, CHRNA3, CHRNB4, and HCN1). EPIC’s chi-square test approach demonstrates higher power than MAGMA. We also compared the number of significant genes for eight traits – after Bonferroni correction, EPIC detected more genes than MAGMA (Supplementary Figure S5). Additionally, we report gene-level association tests for a set of housekeeping genes48 and demonstrate that, while powerful, EPIC also controls for type I error (Supplementary Figure S6).
We next applied EPIC to identify the trait-relevant tissues by performing tissue-specific regression for each trait, with results shown in Figure 2, Figure 3A, and Figure 4A. All four lipid traits are significantly enriched in the liver, which plays a key role in lipid metabolism. Specifically, LDL, TC, and TG showed strong enrichment in the liver (Figure 2A, Figure 2C, and Figure 2D), suggesting that these three traits are embedded in a similar genetic architecture and share the same relevant tissue. The small intestine was marginally significant for TC – it has been shown that the small intestine plays an important role in cholesterol regulation and metabolism49-51. On the other hand, HDL exhibited a slightly different enrichment pattern (Figure 2B): liver and two adipose tissues are identified as being significantly enriched by both EPIC and MAGMA. Both LDSC-SEG and RolyPoly suffer from low power, although the liver was one of the top-ranked tissues for the lipid traits.
Neuropsychiatric disorders exhibited strong brain-specific enrichments, as expected. The frontal cortex of the brain was detected as being the most strongly enriched for SCZ, BIP, and SCZBIP (Figure 4A). The pituitary also demonstrated strong enrichment signals with SCZ and SCZBIP, while the spinal cord was found to be an irrelevant tissue with these three neuropsychiatric disorders. In comparison, LDSC-SEG identified part of the brain tissues as trait-relevant, while RolyPoly failed to return enrichment in any of the brain tissues (Figure 4A).
As a final proof of concept, we sought to infer T2D-relevant tissue(s) using the tissue-specific gene expression data GTEx. The pancreas and the liver were prioritized as the T2D-relevant tissues by EPIC, while MAGMA yielded significant results in the pancreas as well as the stomach (Figure 3A). RolyPoly identified the pancreas as the second most relevant tissue; LDSC-SEG reported liver as the only significantly enriched tissue (Figure 3A). For validation, we adopted a similar strategy as proposed by Shang et al.19 – we carried out a PubMed search, resorting to previous literatures studying the trait of interest in relation to a particular tissue or cell type. Specifically, we counted the number of previous publications using the key word pairs of trait and tissue/cell type and calculated the Spearman’s rank correlations between the number of publications and EPIC’s tissue-/cell-type-specific p-values (Figure 5). Across all traits, we found strong positive correlations between EPIC’s enrichment results and PubMed search results (Figure 5A).
Cell-type enrichment for T2D by scRNA-seq data of pancreatic islets
We next analyzed pancreatic islet scRNA-seq data to identify trait-relevant cell types for T2D. To assess reproducibility, EPIC was separately applied to two scRNA-seq datasets consisting of multiple endocrine cell types (Supplementary Table S1 and Supplementary Figure S7). The scRNA-seq data were generated using two different protocols: the Smart-seq2 protocol on six healthy donors from Segerstolpe et al.9 and the InDrop protocol on three healthy individuals from Baron et al.7. Following the pre-processing step as described in Materials and Methods, we retained a total of 5,488 genes to prioritize pancreatic cell types for T2D. In both datasets, beta cells were identified as the trait-relevant cell types by EPIC (Figure 3B). This finding was supported by known biology, in that beta cells participate in insulin secretion and are gradually lost in T2D52-54. We also found that gamma cells were marginally associated with T2D in the Segerstolpe dataset. Pancreatic polypeptide, which is produced by gamma cells, is known to play a critical role in endocrine pancreatic secretion regulation55-57. However, neither MAGMA nor LDSC-SEG detected significant enrichment in beta cells, even though the enrichment was top-ranked. RolyPoly, on the other hand, did not report any enrichment of the beta cells compared to the other types of cells.
To identify specific genes that drive the significant enrichment in beta cells, we carried out the gene-specific influence test as outlined in Materials and Methods and identified 142 highly influential genes (Figure 3C). We then performed KEGG pathway analysis and Gene Ontology (GO) biological process enrichment analysis using the DAVID bioinformatics resources58,59. Beta-cell-specific influential genes are enriched in GO terms including glucose homeostasis and regulation of insulin secretion, as well as KEGG pathways including insulin secretion, maturity-onset diabetes of the young, and type II diabetes mellitus (Figure 3C and Supplementary Table S4). Additionally, the cell-type ranks obtained from EPIC’s beta-cell-specific p-values was highly consistent with those from the PubMed search results (Figure 5B). We demonstrate the effectiveness of EPIC in identifying tissue-relevant cell types using scRNA-seq datasets generated by different protocols.
Cell-type enrichment for neuropsychiatric disorders by scRNA-seq data of brain
To further test EPIC in a more complex tissue, we sought to prioritize trait-relevant cell types in the brain. While the brain tissues are significantly enriched using the GTEx bulk-tissue RNA-seq data (Figure 4A), the relevant cell types in the brain for neuropsychiatric disorders are not as well defined and studied. We obtained droplet-based scRNA-seq data8, generated on frozen adult human postmortem tissues from the GTEx project (Supplementary Table S1), to infer the relevant cell types. After pre-processing and stringent quality controls, the scRNA-seq data contains gene expression profiles of 17,698 genes across 14,137 single cells collected from the human hippocampus and prefrontal cortex tissues. The cells belong to ten cell types (Figure 4B), and we focused on the top 8,000 highly variable genes for subsequent analyses.
We evaluated EPIC’s cell-type-specific enrichment results and found that all three neuropsychiatric disorders were significantly enriched in GABAergic interneurons (GABA), excitatory glutamatergic neurons from the prefrontal cortex (exPFC), and excitatory pyramidal neurons in the hippocampal CA region (exCA). Excitatory granule neurons from the hippocampal dentate gyrus region (exDG) were identified as relevant cell types for SCZ and SCZBIP (Figure 4C). EPIC successfully replicated the previously reported association of neuropsychiatric disorders with interneurons and excitatory pyramidal neurons14,15.
We employed three strategies to validate the trait-relevant cell types for the neuropsychiatric disorders. First, we again found positive Spearman correlations with PubMed search results and EPIC’s enrichment results for SCZ and SCZBIP (Figure 5C). Second, we adopted additional independent GWAS summary statistics for SCZ (SCZ2)60 (Supplementary Table S1) and observed highly concordant enrichment results between SCZ and SCZ2 (Figure 4C). Third, we tested whether genes that are upregulated/downregulated for SCZ were enriched in the identified cell types to additionally implicate cell types involved in SCZ. Specifically, we performed differential expression (DE) analysis from an independent case-control study of SCZ using bulk RNA-seq61, retaining 287 significant DE genes that also overlap the scRNA-seq data (Supplementary Figure S8). We reasoned that, if SCZ-relevant risk loci were enriched in a particular cell type, genes that are differentially expressed between SCZ cases and controls would demonstrate greater cell-type specificity in this cell type. We calculated cell-type specificities using the set of DE genes and observed GABA, exCA, exDG, and exPFC were the top four cell types with the lowest gene-specificity ranks (Figure 4D). Using three different strategies by querying external databases and adopting additional and orthogonal datasets, we validated the trait-cell-type relevance results.
Discussion
Over the last one and half decades, GWASs have successfully identified and replicated genetic variants associated with various complex traits. Meanwhile, bulk-tissue and single-cell transcriptomic sequencing allow tissue- and cell-type-specific gene expression characterization and have seen rapid technological development with ever-increasing sequencing capacities and throughputs. Here, we propose EPIC to address the problem of how GWAS summary statistics should be integrated with bulk-tissue or single-cell transcriptomic data to prioritize trait-relevant tissue or cell types and to elucidate disease etiology. To our best knowledge, EPIC is the first method that prioritizes tissues and/or cell types for both common and rare variants with a rigorous statistical framework to account for both within- and between-gene correlations. We demonstrate EPIC’s effectiveness and outperformance compared to existing methods with extensive benchmark and validation studies.
For scRNA-seq data, all existing methods, including EPIC, resort to pre-clustered/annotated cell types and average across cells to obtain cell-type-specific expression profiles. However, scRNA-seq goes beyond the mean measurements62,63, and how to make the best use of gene expression dispersion, nonzero fraction, and other aspects of its distribution needs further method development64. Additionally, while many efforts have been devoted to identifying enrichment of discretized cell types, how to carry out enrichment analysis for transient cell states needs further investigation. Last but not least, when multiple scRNA-seq datasets are available across different experiments, protocols, or species, borrowing information from additional sources can potentially boost the performance and increase the robustness of the enrichment analysis65. While it is nontrivial to directly perform gene expression data integration, a cross-dataset conditional analysis workflow was proposed by Watanabe et al.24 to evaluate the association of cell types based on multiple independent scRNA-seq datasets. However, the linear conditional analysis may not be sufficient to capture the nonlinear batch effects46,66.
It is also worth noting that CoCoNet, MAGMA, and EPIC first carry out a gene-level association test, so that the summary statistics and expressions are unified to be gene-specific. They adopt different methods to integrate SNP-wise summary statistics, and SNPs need to be annotated to genes based on a window surrounding each gene. While RolyPoly and LDSC-SEG model on the SNP level directly, each SNP still needs to be assigned to a gene so that the gene expression can be used as a SNP annotation. There is not a consensus on how to most accurately assign SNPs to genes, and more importantly, one would only be able to do so for SNPs that reside in gene bodies or promoter regions. Meanwhile, a large number of GWAS hits are in the non-coding regions, and their functions are yet to be fully understood. EPIC’s framework can be easily extended to infer enrichment of non-coding variants when combined with the single-cell assay for transposase-accessible chromatin using sequencing (ATAC-seq) data67,68. Additionally, cell-type-specific expression quantitative trait loci from the non-coding regions69 can also be integrated with the second-step gene-property analysis to boost power and to infer enrichment of non-coding variants.
Data Availability
GWAS summary statistics are downloaded from public repositories listed in Supplementary Table S1. Genotypes from the 1000 Genomes Project reference panel are available at https://ctg.cncr.nl/software/magma. Bulk RNA-seq and scRNA-seq data are downloaded from GTEx v8 at http://www.gtexportal.org/. ScRNA-seq read counts from two pancreatic islet studies are publicly available with accession GSE814337 and E-MTAB-50619. We obtain a list of human housekeeping genes from the Housekeeping and Reference Transcript Atlas48 at https://housekeeping.unicamp.br/.
Code Availability
EPIC is compiled as an open-source R package available at https://github.com/rujinwang/EPIC.
Authors’ Contributions
Y.J. and D.L. initiated and envisioned the study. R.W., Y.J., and D.L. formulated the model; R.W. developed and implemented the algorithm. R.W., D.L., and Y.J. performed data analysis. R.W. and Y.J. wrote the manuscript, which was edited by D.L..
Competing Interests
The authors declare no competing interests.
Acknowledgments
This work was supported by the National Institutes of Health (NIH) grant P01 CA142538 (to D.L. and Y.J.), R35 GM138342 (to Y.J.), and R01 HG009974 (to D.L.). The authors thank Drs. Yun Li, Michael Love, Karen Mohlke, and Jason Stein for helpful discussions and comments, and Drs. Alkes Price, Diego Calderon, and Kyoko Watanabe for providing support and insight on existing methods.