Abstract
Thousands of genetic variants acting in multiple cell types underlie complex disorders, yet most gene expression studies profile only bulk tissues, making it hard to resolve where genetic and non-genetic contributors act. This is particularly important for psychiatric and neurodegenerative disorders that impact multiple brain cell types with highly-distinct gene expression patterns and proportions. To address this challenge, we develop a new framework, SPLITR, that integrates single-nucleus and bulk RNA-seq data, enabling phenotype-aware deconvolution and correcting for systematic discrepancies between bulk and single-cell data. We deconvolved 3,387 post-mortem brain samples across 1,127 individuals and in multiple brain regions. We find that cell proportion varies across brain regions, individuals, disease status, and genotype, including genetic variants in TMEM106B that impact inhibitory neuron fraction and 4,757 cell-type-specific eQTLs. Our results demonstrate the power of jointly analyzing bulk and single-cell RNA-seq to provide insights into cell-type-specific mechanisms for complex brain disorders.
Introduction
The progression of most neurodegenerative and neuropsychiatric disorders, including Alzheimer’s disease (AD), commonly disrupts a broad spectrum of regulatory networks at the genomic and epigenomic levels, and poses a significant challenge to elucidating the mechanisms underlying the disease progression. In the genome-wide association studies (GWAS), AD is highly heritable, and most of the heritability is explained by common genetic variants. However, it is also highly polygenic, involving potentially hundreds of independent regulatory mechanisms.1–3From the transcriptomic and epigenomic profiling of postmortem samples across different brain regions, we discover the target genes and regulatory elements perturbed by the disease progression and gain insights into the mechanisms of the genetic variants through the regulatory networks.3–5However, many different factors contribute to the variability of the existing postmortem brain data, and it is crucially important to identify and classify these factors by information source.
A large part of the complexity in mental health traits stems from cell-type heterogeneity in brain tissues.6Transcriptomic and epigenomic profiling at single cell-level resolution provides a principled tool with which to investigate the relationship between changes in cell-types and AD pathology. In our recent study, using single-cell RNA-seq (scRNA-seq) data across 80,660 nuclei isolated from post-mortem brain samples across 48 individuals, we discovered seven major types and 40 subtypes of the brain cells. We used these data to recognize cell-type-specific alterations associated with diverse pathological variables including age, sex, and AD pathology.6However, single-cell RNA-seq profiling is often limited to a small number of individuals and optimized to yield a large number of cells. However, most single-cell studies, including our published scRNA-seq analysis,6only involve at most tens of individuals:7too few to measure correlations with other population-level variables such as genotype information. At the bulk tissue-level, however, RNA-seq studies routinely profile hundreds of individuals. For example, the Religious Orders Study and the Memory and Aging (ROSMAP) study has profiled bulk RNA-seq from more than 400 individuals with matched genotype information3,8, and the Genotype-Tissue Expression consortium (GTEx) has profiled more than 2,500 brain samples across 13 brain regions from genotyped individuals.9
Here we develop a novel integrative framework to uncover cell-type-specific alterations of bulk samples, by combining both single-nucleus and bulk RNA-seq data with computational deconvolution followed by comprehensive association analysis. We develop a highly accurate deconvolution method which takes into account individual-level heterogeneity present in both single-nucleus and bulk data. We also directly address systematic discrepancies between single-nucleus and bulk data by characterizing substantial technical inconsistencies between them and developing a transformation approach to overcome them. We apply this method systematically across 3,387 samples to study the variation of neural cell-types across brain regions and their association with other variables measured in the bulk data. We then interrogate the mechanisms at the resolution of pathways and genetic regulatory networks by deconvolving the tissue-level eQTL models into cell-type-specific models.
Results
SPLITR deconvolution accounting for biological covariates and bulk-vs-single-cell differences
Existing deconvolution methods10,11estimate cell-type fractions from bulk RNA-seq data by making the explicit or implicit assumption that bulk RNA-seq data should match the sum of the same set of scRNA-seq data across the different cell types. In practice, however, aggregated scRNA-seq data and bulk RNA-seq data show substantial discrepancies, even for the most established marker genes. One reason for these discrepancies is that single-cell data and bulk data have highly distinct biases due to gene length, mRNA subcellular localization, transcriptional burstiness, mRNA stability, and the cell-to-cell variability of each gene’s expression patterns. This is most pronounced in single-nucleus RNA-seq datasets, as they only capture the nuclear component of each cell’s expression profile. Thus, aggregation of individual single-nucleus expression profiles is not expected to match bulk RNA-seq profiles that also capture the cytoplasmic component of each gene’s expression levels. Systematic corrections are therefore required to relate single-cell datasets into bulk datasets, which are currently not known.
In addition, existing deconvolution methods typically use a single reference profile for each cell type.10,11Such profiles are sometimes obtained by averaging multiple cells of the specific cell type,12–14and other times by using a predicted developmental trajectory.15However, biological variables such as disease status, age, or biological sex can substantially influence expressions of marker genes in both single-cell RNA-seq and bulk RNA-seq samples, making it inappropriate to use the same cell-type-specific reference expression profile for each individual. For example, the expression of several neuronal markers alters with age, sex, and disease status. Similarly, markers for nearly all cell types vary in their expression based on the phenotypic status of each individual. New methods are therefore needed which can tailor the cell-type-specific reference expression profiles used for each individual to their biological covariates.
To address these challenges, we developed a new deconvolution method, SPLITR (for Single-cell Phenotype-aware deconvoLution across Individuals from Total RNA-seq), which explicitly models: (1) inter-individual variation in both bulk and cell-type-specific gene expression levels across biological covariates including age, biological sex, and disease status; and (2) platform-specific biases and differences between single-cell and bulk RNA-seq datasets, including differences in subcellular localization of each gene’s mRNA population in the nucleus/cytoplasm. We achieve this by executing the following three steps of model estimation.
In step 1 of the SPLITR method (Fig. 1a), we use reference single-cell datasets to establish marker genes and reference average expression levels for each target cell type. Here, we focus on brain cell types and use snRNA-seq profiles that our group previously generated across 80,000 cells from 48 individuals, clustered into seven cell types, consisting of excitatory neurons, inhibitory neurons, astrocytes, oligodendrocytes, oligodendrocyte progenitor cells (OPCs), microglia, and pericytes & endothelial cells.6We used these clusters to define a set of 117 marker genes that were the most characteristic of each cell type, based on their differential cell-type-specific gene expression patterns (Supplementary Fig. 1), and confirmed these marker genes agreed with cell-sorted expression profiles16and independent single-cell expression profiles.
In step 2 (Fig. 1b), we study the impacts of three phenotypic/biological covariates on cell-type-specific expression profiles. We use the cell-type-specific expression matrix (pseudo-bulk), created by taking average values over 80k cells across the 48 individuals within the seven cell types. We learn effects of the covariates by estimating a negative binomial model regressing cell-type-specific gene-vectors on the covariate and intercept terms. This modeling enables us to establish phenotype-aware adjustments to cell-type-specific expression patterns according to the phenotypic covariates of each individual whose bulk RNA-seq expression levels are to be deconvolved. For the application to Alzheimer’s Disease, we used age, sex, and pathological AD status, given their correlation with global gene expression changes in each major brain cell type.
In step 3 (Fig. 1c), we build a model that adjusts each marker gene’s bulk expression levels for experimental platform differences between scRNA-seq and bulk RNA-seq learn gene-specific correction terms. In particular, we find that scRNA-seq, and single-nucleus RNA-seq in particular, show gene-specific systematic differences stemming from each gene’s mRNA localization patterns, gene length, transcriptional burstiness, mRNA stability, cell-to-cell variability, and nuclear vs. cytoplasmic fractions. We assume the overall impact of such biases is shared across individuals (samples), and estimate the corresponding terms by leveraging control genes that show consistent expression patterns between the bulk and snRNA-seq data (see Online Methods for details).
In step 4 (Fig. 1d), we use these learned parameters to deconvolve the bulk expression profile by fitting a negative binomial regression of the bulk profile, adjusting for the learned phenotype-specific correction terms for each individual, and the platform-specific correction terms for each gene. For each cell type, we calculate an activity fraction estimate, corresponding to the number of transcripts produced by each cell, and a cell proportion estimate, correcting for the overall activity level of each cell type (Supplementary Fig. 3). In the case of brain, for example, we found that neuronal cells generate nearly four times more mRNA transcripts than glial cells, so their activity fraction estimates are approximately four times larger than their cell proportion estimates.
Deconvolution of 3,386 bulk samples and experimental validation of cell type proportions
We used SPLITR to deconvolve a total of 3,386 bulk RNA-seq post-mortem brain samples, encompassing 15 brain regions from 1,127 individuals across three different studies (Fig. 2): (1) 482 dorsolateral prefrontal cortex (DLPFC) samples from the Religious Orders Study and Memory and Aging Project (ROSMAP) from Rush University3,17; (2) 263 temporal cortex samples from the Mayo Clinic Brain Bank18, and (3) 2,642 samples across 13 brain regions in the Genotype-Tissue Expression (GTEx) data9.
Despite common discrepancies in the literature between different methods of estimating cellular proportions in brain samples19, we found that the resulting activity-corrected cell type fraction estimates were consistent with several previously-reported fractions from direct measurements. For example, the average cell count fractions estimated from the 482 dorsolateral prefrontal cortex samples in the ROSMAP cohort were 31% for neurons, 51% for glial cells, and 18% for vascular cells, consistent with previous estimates using the fractionator sampling method20and the isotropic fractionator21. Excitatory neurons were 4 times as abundant as inhibitory neurons, in line with previous reports22.
We also experimentally confirmed these cell type fractions using immunostaining in matched samples (Fig. 2c). Our SPLITR estimated of astrocytes were 18% of cells (±5%), consistent with our immunostaining-measured average of 18% across 8 ROSMAP samples. Even for microglia, which are both less abundant and smaller cells, thus biasing their abundance in some single-cell preparations, our SPLITR-based estimate of 9% (±5%) was consistent with our immunostaining-measured average of 10%.
The cell type proportion estimates sometimes differed from the counts of nuclei obtained for each cell type from the 10X protocol6, with the single-nucleus counts sometimes closer to our activity fraction estimates. This is likely due to experimental biases in earlier versions of the 10X protocol to more efficiently capture larger nuclei with more transcripts. For example, astrocytes, microglia, pericytes, and endothelial cells were under-represented in our 10X datasets, while excitatory neurons and oligodendrocytes were overrepresented compared to previous estimates. This indicates that SPLITR deconvolution can provide accurate estimates of cell type proportions, even when based on single-cell datasets with skewed proportions, as it is based on the expression patterns inferred from the single-cell datasets, rather than the cell proportions in those datasets.
Our estimated cell type proportions also captured known variability across different brain regions. For example, we found a substantially higher fraction of oligodendrocytes in hippocampus than frontal cortex (48% vs. 16% on average), more microglial cells in hippocampus than cortex (12% vs. 6% on average), and fewer neurons in hippocampus than in temporal cortex or in frontal cortex (15% vs. 21% vs. 42% on average).
Our estimates also captured cell type proportions vary across different cohorts associated with the different age ranges of the individuals profiled (Supplementary Tab. 1). Comparing the younger GTEx cohort (59 years old on average) with the older ROSMAP cohort (88 years old on average), we found that neurons decreased from 42% to 31% of cells, while glia increased from 38% to 50%, consistent with neuronal loss during aging.
We also compared the results of SPLITR with CIBERSORT10, using the same marker gene profiles (Supplementary Fig. 2). We found a general agreement for higher-abundance cell types, including excitatory neurons, oligodendrocytes, astrocytes, and microglia. However, for two of the cell types, CIBERSORT showed systematic problems, resulting in 0% estimates for inhibitory neurons for 68% of samples, and 0% estimate for oligodendrocyte progenitor cells for 79% of samples. Moreover, while SPLITR captured differences due to sex and age, CIBERSORT did not capture these subtle differences (Supplementary Fig. 2b-c).
Discovery of genetic variants influencing cell type proportion
We first sought to recognize genetic variants that may underlie these cell type proportion differences between individuals. Treating the SPLITR-inferred proportions of each of the 7 cell types as a quantitative trait, we carried out a cell-fraction genome-wide association study (cfGWAS) to recognize genome-wide significant and sub-threshold loci associated with cell type fraction (cfGWAS hits), which we define as single-nucleotide polymorphisms (SNPs) that govern cell type proportions. Using 403 ROSMAP individuals that have both genotype and RNA-seq data, we found several genome-wide significant (P<5e-8) and sub-threshold (P<1e-5) associations with cell type proportions.
The strongest association (P=6.4e-09) was between reduced excitatory neuron fraction and >100 SNPs in the TMEM106B locus of chromosomal segment 7p21.3, including the A-to-G rs1990620 SNP (Fig. 3a). This locus is not previously-associated with AD, but it is associated with an AD-related neurodegenerative disease, frontotemporal lobar degeneration with TDP-43 inclusions (FTLD-TDP), and also with decreased-neuronal-fraction allele showing increased FTLD-TDP risk and decreased cognition in amyotrophic lateral sclerosis (ALS)23–26. Indeed, this association was not due to AD pathology in our cohort, and the cell-type-proportion association remained statistically significant after conditioning on the pathological phenotypic variables, such as the accumulation amyloid-beta and neurofibrillary tau (NFT) proteins, and pathogical AD. These results suggest that the TMEM106B locus is an AD-independent contributor to cognitive decline, via decreased neuronal fraction.
Genotype-associated expression variation indicates that two nearby genes may mediate the TMEM106B locus genetic effect on neuronal fraction: the TMEM106B gene itself, a transmembrane gene invoved in dendrite morphogensis and the regulation of lysosomal trafficking, and GRN, an age-associated27essential gene involved in tau-negative FTLD28and lysosomal dysfunction during FTLD progression.29Both genes showed significantly-reduced tissue-level expression associated with the decreased-neuronal-fraction allele (p<3e-06 and p<2e-03, respectively), and previous studies indicate that TMEM106B interacts with GRN, and that rs1990620 may be the causal variant in this locus, via disruption of a CTCF binding motif that alters a topologically-associated domain and up-regulates TMEM106B30. The associations were still significant when only including the controls (p<9e-04 and p<1.5e-02, respectively), and cell-type-specific eQTL analysis with cell-sorted and snRNA-seq data confirmed over-expression of TMEM106B in neurons, astrocytes, and oligodendrocytes, consistent with the previous reports31,32.
Several additional subthreshold-level associations with excitatory neuron fraction were found overlapping known causal genes in neurodevelopmental processes, including: ERBB4, a risk gene for schizophrenia and a selective and functional marker gene for glutamatergic33and GABAergic synapses34in inhibitory neurons and interneurons; DAOA associated with schizophrenia in an Asian cohort35; and NPAS2, conferring neuropsychiatric anxiety disorder and regulating GABAergic signal transmissions36. While the main signal in the TMEM106B locus affects excitatory neurons, we found an additional genetic signal associated with both inhibitory neurons associated with rs1990620 (P=3.08e-6). Lastly, we found an additional association with inhibitory neurons within the TMEM106B locus, with cfGWAS SNP rs4721064 (p=1.69e-06), whose effect is independent and additive with that of rs1990620.
Deconvolved cell-type fractions are associated with increased risk for diverse phenotypes
We found that changes in cell type fraction were also associated with increased risk for multiple traits, even when these were not directly measured in our cohort (Fig.3c-d). For all 944 genotyped individuals across ROSMAP and GTEx, we used their complete genotype information across millions of common variants to calculate their genetic risk for a set of 56 traits (Supplementary Tab. 2), using polygenic risk score (PRS) estimates from GWAS summary statistics data (p-value threshold 0.01, with linkage disequilibrium decorrelation37,38instead of pruning). A total of 17 traits showed nominal significance (p-value<0.05), including Alzheimer’s Disease and Crohn’s disease.
We found several noteworthy examples (Fig. 3c-d): higher microglial fraction was associated with increased AD risk and increased body fat percentage, but decreased risk for type 2 diabetes (T2D) in ROSMAP; higher oligodendrocyte progenitor cell fraction was associated with increased risk of depression; higher pericyte and endothelial fractions was associated with increased risk of post-traumatic stress disorder (PTSD) and decreased risk of smoking; higher astrocyte fraction was associated with increased risk of depression and decreased risk of drinking; higher inhibitory neuron fraction was associated with higher cognitive performance and lower schizophrenia risk; lastly, an increased oligodendrocyte fraction was associated with decreased risk of Inflammatory Bowl Disease (IBD).
Many traits associated with the same cell type showed only negligible correlation at the overall PRS level, indicating that our method can capture correlations not directly visible using only genetic information.
Cell-type fraction differences associated with Alzheimer’s pathology, biological sex, and age
We next investigated whether cell type proportion changes were associated with phenotypic differences between individuals within the ROSMAP cohort, where phenotypic variables are readily available (Fig. 4).
We found that AD-related pathological variables were strongly associated with cell type proportion differences (Fig. 4c; Supplementary Fig. 4). Amyloid-beta deposition showed the strongest associations with fewer excitatory neurons (P<8e-5), fewer inhibitory neurons (P<3e-3), more oligodendrocytes (P<2e-6), more astrocytes (P<3.8e-3), and more pericytes/endothelial cells (P<1e-4). Tau-protein deposition and loss of cognition also showed significant associations with fewer excitatory neurons and more oligodendrocytes.
We also found that cell type proportions were strongly associated with both biological sex and age (Fig. 4a-b). Male samples showed a higher fraction of excitatory neurons than female samples (P<0.004, Wilcoxon rank-sum test) and a lower fraction of astrocytes (P<0.03), oligodendrocyte progenitor cells (P<.006), and vascular cells (P<2e-5) (Fig. 4a). In addition, older individuals (>100 years old) showed different cell type proportions than younger groups (<90 years old), with fewer excitatory neurons (P<0.008), more astrocytes (P<0.003), and fewer microglia (P<0.002) (Fig. 4b).
These results indicate that our deconvolved cell type fractions successfully capture cell type proportion changes associated with phenotypic differences, even though only bulk samples were utilized for these analyses.
Cell-type-specific gene expression changes in AD show biologically-meaningful functional enrichments
We found 2,470 genes with cell-type-specific changes associated with amyloid-beta, neurofibrillary tangles, and episodic memory decline in one of the seven cell types (Fig. 5a-b), using a generative model that captures the relationships between each gene’s transcript level with an interaction term of cell type and each pathological variable (age, sex, RIN scores, and other phenotypes). We controlled the FDR at 4.4% with the null distribution constructed by the Freedman-Lane permutation39of only one interaction term at a time while fixing all other correlated variables (Fig. 5a, Supplementary Fig. 5a). Only 12 of these 2,470 genes are among the 171 cell-type-marker genes.
These 2,470 genes showed highly cell-type-specific enrichment across 191 gene ontology (GO) terms (Fig.5c) and 88 MSigDB40canonical pathways (Fig.5d) (FDR < 5%). Distinct enrichments were sometimes found for distinct AD phenotypes, between memory loss, neurofibrillary tangles, and amyloid-beta.
For example, genes with inhibitory-neuron-specific expression differences associated with AD pathology were enriched in intracellular transport (including endoplasmic reticulum) for memory-associated expression changes, and with mitochondrial biology for amyloid-associated changes. Genes with oligodendrocyte-specific expression differences associated with AD pathology were enriched in notochord development for memory-associated changes consistent with their roles in remyelination41,42and with our single-cell analysis results6, and with mesenchymal differentiation for tangles-associated changes. Genes with microglia-specific expression differences in AD were enriched in synaptic plasticity43for neurofibrillary-tangles-associated expression changes, in mitochondrial functions for memory-associated expression changes, and fatty acid metabolism for amyloid-beta-associated expression changes. Genes with astrocyte-specific expression differences in AD were enriched in cytokines and secretion for amyloid-beta-related phenotypes, consistent with secretion of pro-inflammatory cytokines in astrocytes with the accumulation of amyloid-beta44.
These results reveal a complex set of cell-type-specific alterations in diverse pathways associated with distinct phenotypic signatures of AD, provide important insights into the cellular and molecular changes in AD, and demonstrate SPLITR’s ability to recognize cell-type-specific from bulk RNA expression.
Sparse Bayesian regression deconvolves tissue-level genetic effects into cell-type-specific eQTLs
To help elucidate causal paths between genetic variation and complex brain disorders, we next sought to recognize genetic variants with cell-type-specific effects on brain gene expression, both at the bulk level and at the cell-type-specific level. For tissue-level eQTLs, we used our previously-described sparse Bayesian multivariate model45, and for cell-type-level eQTLs we developed a new Bayesian eQTL deconvolution framework that models the observed bulk genetic effects as a mixture of cell-type-specific genetic effects, and infers a cell-type-specificity score (between 0 and 1) for each eQTL gene (eGene) in each cell type, corresponding to the probability with which this gene has cell-type-specific genetic effects for that cell type (Fig. 6a; Methods). To compare the performance of our deconvolved multivariate approach, termed deQTL, with other interaction QTL methods, we simulated realistic gene expression data, embedding a single causal cell type for each gene. We repeated our experiments on 121 randomly-selected linkage disequilibrium (LD) blocks, varying the level of expression heritability and number of causal eQTL variants (see Methods). In power analysis, our proposed approach clearly outperforms the other methods frequently used in cell-type interaction QTL analysis (Fig. 6b). Moreover, under the high heritability regime (> 10%), the posterior probability of the deQTL model accurately distinguish causal cell types from the non-causal ones (Fig. 6c).
We applied this method to the 403 ROSMAP individuals that have both genotype information and gene expression information available. At the tissue level, we found a total of 5,586 eQTL genes (eGenes) with highly-heritable gene expression, associated with a total of 7,783 independent SNPs. At the cell-type-level, we found a total of 3,869 eGenes with cell-type-specificity score >0.9, associated with 4,757 independent SNPs. Approximately half of tissue-level eGenes (N=2,687, 48%) were also discovered at the cell-type level (Supplementary Fig. 5), enabling us to partition their genetic effects into the cell-types where they act.
A large fraction of cell-type-specific eGenes (N=1,182, 30%) were not discovered in our tissue-level analysis, indicating that our approach can discover high-confidence cell-type-specific eGenes even when these are not visible at the tissue level (Supplementary Tab. 3). For example, DRD5 showed no genetic association at the tissue level, but individuals with the TT allele of rs6448858 (chr4:9595918) were in the top 40% of samples with highest excitatory neuron content, resulting in a high interaction term in our model, and a high cell-type-specificity score (Fig. 6d). Similarly, ICA1 showed no tissue-level genetic effect, but individuals carrying the CC allele of rs6965329 (chr7:8161981) lay were among the 20% of samples with highest inhibitory neuron fractions (Fig. 6e).
Most cell-type-specific eGenes act in a single cell type (N=3,133, 81%), and a minority act in multiple cell types (N=736, 19%). Most act in inhibitory and excitatory neurons (61%), followed by oligodendrocytes (n=710), astrocytes (n=588), microglia (n=364), pericytes & endothelial cells (n=319), and oligodendrocyte progenitor cells (n=24) (Supplementary Tab. 1). For 872 cell-type-specific eGenes we found multiple independent eQTL variants, indicating more complex genetic control. Conversely, for 267 cell-type-specific eQTLs, we found multiple target eGenes, implicating gene-level pleiotropy.
Stratification of the GWAS polygenic risk score (PRS) models by the deconvolved eQTL annotations
Lastly, we sought to recognize the cell types where disease-associated genetic variants exert their effect for diverse brain disorders, using genome-wide statistics for 56 neuronal, behavioral, psychiatric, and neurodegenerative traits (Supplementary Tab. 1). For each of the seven major cell types, we computed a PRS for each of the traits, using all nominally-significant (P-value<1e-2) SNPs that lie within a ±1 kb window of an annotated cell-type-specific eQTL for that cell type (Fig. 7a). We then calculated the enrichment for each cell type by comparing the cell-type-specific PRS score to the PRS score obtained using all the SNPs.
Across all 1,682 individuals in the ROSMAP cohort46, we found 15 GWAS traits that show significant cell-type-specific PRS scores across 19 cell types (FDR<10%), indicating that genetic variants in that trait preferentially act through cell-type-specific eQTLs in that cell type (Fig. 7b-c). For example, we find that microglial-specific eQTLs contribute disproportionately to the risk scores of AD2, OCD (obsessive-compulsive disorder)47,48and ASD (autism spectrum disorder)49. Similarly, oligodendrocyte-specific eQTLs significantly enrich GWAS signals of osteoarthritis50and cognitive performance51. Pericyte and endothelial-specific eQTLs contribute disproportionately to increased risk of smoking52, UC (ulcerative colitis)53, allergy54, and depression and bipolar disorders54.
The importance of microglial cells in AD is well-recognized55,56, and several AD genes, such as BIN157and MS4A458, are shown to act specifically in microglial cells. For ASD, the previous analysis showed male-specific over-expression of microglial marker genes in the cortex59; for OCD, a mouse study showed that over-expression of NFKB/TNF-alpha pathways causally acts on relevant traits, such as excessive self-grooming behavior and hyperexcitability of the corresponding neurons60.
Discussion
Understanding the mechanism of complex traits, including neurodegenerative disorders, has become a crucial component of prevention and treatment, yet remains a challenging and open problem. Part of the challenge stems from the complexity of the diseases at the cellular and molecular levels. A causal mechanism of complex traits is often manifested through multiple layers of genomic and epigenomic regulatory networks. The emerging technology of single-cell and single-nucleus sequencing provides unbiased profiling of cell types from a mixture of samples. Knowing the relevant cell-type context is a crucial step toward dissecting the complexity of diseases. Cell-type information enables biologically-informed Bayesian and causal inference, and improves experimental design in a matched cellular environment.
However, most single-cell-resolution profiling experiments cover a limited sample size and do not include the investigation of variation across individuals. On the other hand, while tissue-level bulk RNA-seq data fail to reach a cell-type resolution, they often carry a sufficiently large sample size. From richly-phenotyped bulk data, we can identify population-level associations of transcript measurements with other variables, such as genetic variants and phenotypes. Associations with small-effect variables are only made possible with a large cohort. Computational deconvolution methods, including SPLITR, abridge the gap between snRNA-seq and bulk RNA-seq data. We learn cell-type models from snRNA-seq and estimate cell-type fractions in the bulk data so that subsequent analysis can leverage a large sample size and rich phenotypic information.
Here, we present a highly calibrated deconvolution method, SPLITR, followed by a series of integrative studies with the variables in large-sampled bulk data. We identified cell-type-specific mechanisms of AD and other relevant disorders at the phenotype, demographic information, pathway, and genotype-level. Moreover, we characterized putative mechanisms, which may have impacted AD and other diseases, while pinpointing a molecular and pathway-level basis for understanding the comorbidity of complex neurodegenerative disorders. For instance, our results already suggest that microglial cells are a converging point of AD and neuropsychiatric disorders, such as OCD and ASD. Genetic markers in TMEM106B implicate potential pleiotropy between AD and FTLD in neuronal cells. Applying the same principle, we can investigate other neurodegenerative and neuropsychiatric disorders and even diseases in other domains, such as diabetes and cardiovascular disorders.
Materials and Methods
Preprocessing of the ROSMAP, Mayo, and GTEx RNA-seq data
We downloaded the ROSMAP RNA-seq data in the Dorsolateral Prefrontal Cortex (DLPFC) from Synapse (https://www.synapse.org/#!Synapse:syn3388564). We used gene-level expression data quantified by RSEM61, including a total of 55,889 coding and non-coding genes according to the GENCODE annotations (v19). The RNA-seq raw count data in the temporal cortex of 263 individuals from the Mayo RNA-seq project was downloaded from Synapse (https://www.synapse.org/#!Synapse:syn3163039). From the GTEx project (v8), we obtained gene-level count data in 13 brain regions, which will be made publicly available. We removed low-expressed genes (those genes for which fewer than three individuals had counts-per-million > 1) before normalization. We then normalized the RNA-seq raw counts using the trimmed mean of M-values normalization method62.
Definitions of an individual-specific deconvolution model
The ultimate goal of the deconvolution is to estimate the cell type fraction πik of each cell type k in an individual i, treating the selected marker genes as data points. In each bulk sample i, we fit the NB model by regressing the bulk profile vector yi on the estimated cell type profile matrix , learned from snRNA-seq data.
We assume each gene-level quantification, Y (or Ygi for a gene g on sample i), follows Negative Binomial (NB) distribution.63Namely, we define the data likelihood of Y with the mean μ and over-dispersion ϕ parameters: We define the NB model for the deconvolution problem: Here, we introduce auxiliary parameters, besides the π parameter:
si: sample-specific bias term for each individual i (easily estimable)
ϕg: over-dispersion parameter for each gene g (easily estimable)
δg: gene-specific bias term for each gene g in the bulk data
As for the first two parameters, we simply replace the sampling bias si with the sequencing depth of the bulk sample and find a suitable gene-level dispersion parameter δg using an empirical Bayes method implemented in edgeR63. However, finding a suitable δ value is non-trivial as this can be tightly dependent with π and is shared across all the samples. We discuss posterior inference algorithms in the next section.
Reference cell-type models with the sample-specific covariates adjusted (steps 1 and 2)
From the snRNA-seq profiling followed by clustering analysis6, we construct a cell type-specific marker gene expression matrix Rgik (of gene g, sample i, cell type k), including the 171 marker genes (∼25 most differentially expressed in each cell type). Unlike conventional deconvolution methods10,11that directly use these marker gene profiles to estimate cell type fractions of bulk RNA-seq data, we adapt the cell type-specific marker gene model to the heterogeneity of biological and technical covariates.
We train each cell type k’s model by conducting the following NB regression model, NB(Rgik|liμgik, ψg) across i = 1, …, 48 individuals, where li is the library size of sample i. We further specify the mean function μgik as ln with the baseline activity βgk0 and the observed covariates Ci1,C2i, Ci3 correspond to the age, sex, and AD of an individual i, respectively.
We first estimate the overdispersion parameter ψg using DESeq264. We then estimate the NB regression parameters using Stan65and construct the adjusted reference panel for a new sample i by plugging in the trained model parameters, . If all the coefficients (β) except the baseline were set to zero, our reference panel would be identical to the marker gene profiles used in the existing methods, but by including any non-zero effects of the known covariates, we prevent the marker genes from being influenced by these covariates in the subsequent deconvolution steps.
Learning gene-specific bias between the bulk and snRNA-seq (step 3)
In our preliminary experiments, a brute-force parametric estimation method that directly estimate the posterior distribution of the bias and the cell-type fraction parameters often yielded poor results, e.g., high variance. Instead, we estimate δg, assuming individual-level cell type fractions πik can be summarized average : To estimate the average , we leverage the subset of 69 “control” genes (Supplementary Fig. 1b, 1d) whose relative expression levels are robustly stable between the bulk RNA-seq and snRNA-seq data, and less variable across individuals: We optimize them in an EM algorithm by alternating between the two models: one for δ holding fixed and the other for fixing δ values.
Deconvolution to estimate the individual-specific cell type composition (step 4)
Provided that we have estimated auxiliary variables (δ, s, and ϕ) along with the parameters in the individual-specific reference cell type models (β), we resolve the individual-level cell type compositions (πik) in Bayesian inference using using Stan65.
Additional calibration step to compute the cell-level fraction estimates of cell types
To convert the transcript-level cell type fraction estimates to the composition of actual cell counts, we need to adjust a differential level of transcript abundance per cell across different cell types. Using the average number of transcripts per cell within each cell type k in the snRNA-seq data, we reverse-engineer the cell-level fraction (π′) of each type that could have generated the estimate transcript-level fractions (π). We solve the following optimization for π′: subject to for each i and .
Pathway enrichment
We measure the impact of cell-type fractions on downstream transcript levels at the pathway-level. Within each pathway, and for each cell type, we compute gene-level z-scores that estimate significance of covariance between the cell-type fraction and the genes in the pathway. Say that we construct a test statistic for a pathway with m genes on a cell type k: we first standardize the cell-type fraction scores pik (for individualn i = 1, …, n, and cell type k) and gene expression xig (for an individual i and a gene g), and construct a gene-level score . Combining these, we have test statistic across m genes within each pathway. We estimate the null distribution by sample permutation along the individual axis.
Genotype data imputation
We collected genotypes of 672,266 SNPs in 1,709 individuals from the Religious Orders Study (ROS) and the Memory and Aging Project (MAP)46in the GWAS for detecting cfGWAS hits. We mapped hg18 coordinates of SNPs (Affymetrix GeneChip 6.0) to hg19 coordinates, matching strands using publicly available information (http://www.well.ox.ac.uk/~wrayner/strand/GenomeWideSNP_6.na32-b37.strand.zip). We retained only those SNPs with MAF>0.05 and Hardy-Weinberg equilibrium (HWE) p-value>1e-04, computed based on 432 individuals who had all phenotype, genotype, and gene expression data. We imputed the genotypes by pre-phasing haplotypes based on the 1000 genome project66(phase I version 3) using SHAPEIT67. We then imputed SNPs in 5MB windows using IMPUTE268with 100 Markov Chain Monte Carlo iterations and 10 burn-in iterations and retained only SNPs with MAF> 0.05 and imputation quality score>0.6. For the Mayo RNA-Seq project, we used a genotype dataset imputed by the Michigan Imputation Server69with the Haplotype Reference Consortium (hrc.r1.1.2016) panel70. The following documents provide more details about the Mayo dataset: https://www.synapse.org/#!Synapse:syn8650955.
Polygenic risk scores
We modeled the polygenic risk ρi of an individual i as a weighted average of scaled genotype information71: ρi = ∑j Gijθj where we take weighted average of genotype information Gij (of individual i and SNP j) with the coefficients θj transferred from GWAS summary statistics data with the p-value threshold (p < 0.01) but the LD (linkage disequilibrium) structures decorrelated. Lacking individual-level phenotypes on all the available GWAS statistics, we fixed the p-value cutoff and the LD pruning steps were replaced with the decorrelation steps.37,38However, fine-tuning these parameters will only improve the performance.
Sparse Bayesian regression to deconvolve tissue-level genetic effects into cell-type-specific eQTLs
We designed the deconvolved eQTL (deQTL) model from the following Bayesian generative scheme:
For each genetic variant j and cell type k, we sample unique multivariate eQTL effect sizes θjk ∼ a spike-slab prior.72(2) Each cell type k generates expression variation across individuals by a linear model ηik = ∑j Gijθjk on genetic information Gij of each individual i in SNP j. (3) However, we only observe bulk gene expression profile that is a mixture of cell type-specific genetic effects ηik across the seven cell types, with some mixing proportion πk (Fig. 6a). Provided that the estimate cell-type composition π is unbiased, we can model the mean of bulk profile as:
where we additionally include probabilistic loading factor λk ∈ (0, 1).
If we estimate the deQTL model SNP by SNP and cell-type by cell-type (p = 1), this model simply resorts to an interaction eQTL model73, testing non-zero-ness of the coefficient θk in Yi ∼ θjkπik × Gij without two singleton terms, which are π and G. In our multivariate model, we could include these non-interacting terms, but we only found such an over-parameterization was not as powerful as one might have expected. We concluded that these extra terms are rather unnecessary because these are likely to mediate the effect of cell-type-specific genetic variables by construction. It is widely accepted that effect size estimation of a causal path, while conditioning on an intermediate variable, can easily produce a biased result.74
Simulation of bulk eQTL data using actual cell type composition and genotype matrix
We first select a causal cell type out of seven brain cell types where there are genetic effects on causal SNPs. In the Fig. 6b, we only show the results on the data simulated with three causal SNPs, but we varied the number of causal SNPs from 1 to 3. Our simulator generates gene expression data using the actual genotype matrix (G, standardized) and the deconvolved cell type estimations (π). We evaluated statistical power under the different level of total expression heritability (h2), varying from 5% to 40%. Provided that there is one cell type (out of total K=7) genetically-regulated with three causal SNPs, our simulator generates convolved gene expression profiles in the following steps.
For each celltype k ∈ [K] (K=7), if k is causal: we select three causal SNPs (j’s) uniformly at random and sample each genetic effect size θ jk .𝒩 (0, (K / 3)2) For non-causal SNPs, we simply let the effect size θjk = 0. The deconvolved expression vector is constructed by a linear combination of the selected SNPs: yk ∼ Gθkπk.
For the rest of non-causal cell type l ≠ k, we assign the expression vector yl tonon-genetic signals by sampling from isotropic standard Gaussian distribution, and combine them by taking a weighted linear combination, y0 = ∑l∈non-causal ylπl except for the genetically regulated cell types.
We rescale y0 by multiplying a scaling factor to to achieve 𝕍 [y 0] = 𝕍 [η g] (1/ h2 − 1) to ensure that the simulated heritability to match with the assumed level, namely, h2 = 𝕍 [η g]/ (𝕍 [y 0] + 𝕍 [η g]).
The bulk RNA-seq data can be just a linear combination of these simulated celltype profiles: .
Competing deconvolved eQTL methods
We include comparison with other commonly used interaction QTL methods (Fig. 6):
deQTL (this work): We fit multivariate deQTL model with stochastic variational Bayes inference algorithm. We then prioritize cell types in descending order of maximal posterior inclusion probability of genetic effects max , for each cell type k.
deQTL (this work with additional terms): We prioritize cell types by the same procedure as above (deQTL) except that we added extra (and unnecessary) non-interaction terms of genotypes and cell types.
Interaction QTL: We estimate the full set of p-values for conventional interaction QTL analysis using lm(y ∼ cell type * genotype + cell type + genotype) in R. We then summarize each cell type’s score by minimum p-values across SNPs within each cell type. Prioritize the cell types in the ascending order of the minimal p-values.
Immunostaining validation of the predicted cell-type fractions
Fixed human brain tissue (prefrontal cortex, BA10) was sectioned at 50 m using a vibratome (Leica). The sections were boiled in IHC Antigen Retrieval Solution (ThermoFisher Scientific; catalog number 00-4955-58) containing 0.05% Tween-20 for 10 minutes and then placed in PBS for 20 minutes at room temperature. After washing with ddH2O (three times 15 minutes) followed by one wash with PBS for 15 minutes, the brain sections were incubated in quenching solution (50mM ammonium acetate, 100mM CuSO4) at room temperature overnight. After washing with ddH20 (one wash for 15 minutes) and PBS (three times 15 minutes), the sections were permeabilized in PBS containing 0.3% Triton X-100 for 10 minutes and blocked in PBS containing 0.3% Triton X-100 and 5% normal donkey serum at room temperature for 2 h. The sections were incubated for 2 hours at room temperature in primary antibody in PBS with 0.3% Triton X-100 and 5% normal donkey serum. Primary antibodies were an anti-GFAP antibody (1:100; Abcam; ab53554, Goat polyclonal) and anti-Iba1 Antibody (1:500; Synaptic Systems; Cat. No. 234 004, Polyclonal Guinea pig antiserum). The sections were washed with PBS containing 0.3% Triton X-100 and 5% normal donkey serum at room temperature (four times 15 minutes) and then incubated with secondary antibodies (dilution 1:2000) for 2 hours at room temperature. Primary antibodies were visualized with Alexa-Fluor 488 and Alexa-Fluor 594 antibodies (Molecular Probes), and cell nuclei visualized with Hoechst 33342 (Sigma-Aldrich; 94403). The sections were washed with PBS containing 0.3% Triton X-100 and 5% normal donkey serum at room temperature (four times 15 minutes) and then mounted on Fisherbrand (TM) Superfrost (TM) Plus Microscope Slides in ProLong (TM) Gold Antifade Mountant. Images were acquired using a confocal microscope (LSM 710; Zeiss) with a 20x or 40x objective, and cell numbers were quantified using Imaris 8.3.1.