Abstract
Many non-coding genetic variants cause disease by modulating gene expression. However, identifying these expression quantitative trait loci (eQTLs) is complicated by gene-regulation differences between cell states. T cells, for example, have fluid, multifaceted functional states in vivo that cannot be modeled in eQTL studies that aggregate cells. Here, we modeled T cell states and eQTLs at single-cell resolution. Using >500,000 resting memory T cells from 259 Peruvians, we found over one-third of the 6,511 cis-eQTLs had state-dependent effects. By integrating single-cell RNA and surface protein measurements, we defined continuous cell states that explained more eQTL variation than discrete states like CD4+ or CD8+ T cells and could have opposing effects on independent eQTL variants in a locus. Autoimmune variants were enriched in cell-state-dependent eQTLs, such as a rheumatoid-arthritis variant near ORMDL3 strongest in cytotoxic CD8+ T cells. These results argue that fine-grained cell state context is crucial to understanding disease-associated eQTLs.
Genome-wide association studies (GWAS) of autoimmune and allergic diseases have implicated non-coding variants that may regulate T cell gene expression (1–5). However, studies measuring the effect of these variants on bulk gene expression— expression quantitative trait loci (eQTL)—have incompletely explained their pathogenicity (6). Bulk assays obscure heterogeneity that is essential for effective T cell function and often require non-physiologic ex vivo stimulation. Single-cell assays, on the other hand, capture fine-grained physiologic T cell states defined by discrete surface markers (CD4+, CD8+), cytokines (TH1, TH2, TH17), transcription factors (T-bet, RORγt), or transcriptomic programs with varying degrees of expression (effector, cytotoxicity, activation). These states are neither static nor mutually exclusive: they may coexist in the same cell (for example, more effector-like CD4+ TH2 cells, seen in asthma) or they may transition (for example, TH17 cells gradually become IFNγ- and IL-17-coproducing TH17/1 cells seen in tuberculosis antigen-specific cells) (7–10). Certain states are effective therapeutic targets, like TH2s in allergy and TH17s in psoriasis (11, 12).
A T cell’s states may determine the magnitude or presence of eQTLs in that cell. For example, ex vivo activation alters variants’ regulatory effects (13, 14). However, most recent single-cell eQTL studies are unable to achieve this resolution because they identify state-dependent effects by first aggregating cells from each discrete cluster or other phenotypic classification to reduce dimensionality and mitigate sparsity and then using linear models (15–18). This limits the scope of analysis to coarse states that may imperfectly capture T cell biology or arbitrarily partition a continuous transcriptional landscape, such as single-cell differentiation trajectories or functional gradients like cytotoxicity.
In order to understand how regulatory genetic variants interact with the dynamic range of in vivo T cell states essential for disease pathogenesis, here we instead leverage multidimensional cell-state heterogeneity captured in multimodal single-cell assays of resting memory T cells. By considering each cell’s position along multiple continuous functional axes, we can dissect state-dependent eQTL effects at single-cell resolution and better identify disease-relevant regulatory heterogeneity.
Results
Memory T cell eQTLs in Peruvians include shared and ancestry-specific regulatory variation
For this study, we used single-cell expression of the transcriptome and 30 surface proteins from our previous CITE-seq study of memory T cells isolated from 259 healthy Peruvian individuals with prior resolved Mycobacterium tuberculosis infection (10). We chose the surface proteins because of their role in T cell function. After applying quality control (QC) as shown previously, 500,089 cells remained for eQTL analysis, with an average of 4,927 unique molecular identifiers (UMIs) and 1,475 genes per cell (Methods, Fig. S1A-B) (10). We analyzed 5,460,354 genotyped and imputed variants passing QC.
We first defined a core set of eQTLs across all memory T cells, agnostic to state; eQTLs demonstrating a robust main effect would be promising candidates to later test for state-dependent effects in a single-cell model (Fig. 1A). Since we were not yet considering individual cells’ states, pseudobulk analysis was sufficient. We summed the expression of each gene across all cells from each donor (mean = 1,851 cells/donor, Fig. S1C) and treated this expression profile as a bulk sample for sample-level normalization, gene QC, and correction of measured and latent covariates. Next, to define cis-eQTL effects, we tested associations between the covariate-corrected expression of 15,789 genes expressed in >50% of samples and all variants up to 1 MB from each gene’s transcription start site (TSS).
We found 6,511 eGenes with significant cis-eQTLs (q value < 0.05), consistent with previous bulk eQTL studies with similar sample size (19, 20). These genes included previously described eGenes such as CTLA4 and ERAP2 (Fig. 1B-C, Table S1) (20, 21). We also found 808 eQTLs that were driven by genetic variation common in the Peruvian population but rare or absent in Europeans (Table S2). For example, an eQTL for MAF (β = 0.32, p = 3.45×10-7) was driven by rs9927852 (chr16_78894778_T_C: minor allele frequency=22% in study cohort, 27% in 1000 Genomes Peruvians in Lima, Peru [PEL], 1% in European [EUR], Fig. 1D-E) (22). When we conditioned on the lead eQTL for each eGene (n = 6,511), we observed exactly two independent effects at 418 loci, such as MDGA1, and more than two independent effects at 18 loci upon repeated conditional analysis (Fig. 1F-G, Table S3).
To determine if results were consistent with previously published T cell eQTLs, we compared the lead variants’ effects to bulk naive CD4+ T cell eQTLs from individuals of European ancestry (n = 169) reported by the BLUEPRINT project (19). eQTL dynamics described in prior studies suggest that naive and memory T cells share many bulk eQTL effects (2, 21). Despite differences in linkage disequilibrium due to ancestry, technology, and cell type, we observed that the eQTLs from our analysis were largely significant in BLUEPRINT (at q < 0.05, 2,056 significant in both/3,249 significant in current study and measured in BLUEPRINT), and of those that were significant in both datasets, most had concordant directions of effect (1,917/2,056 = 93% same direction, Fig. S2).
Continuous cell states capture functionally distinct dimensions of T cell heterogeneity
Combined single-cell mRNA and protein measurements from CITE-seq allow us to define cell states in conventional ways, such as clustering or gating on protein markers as in flow cytometry (e.g. CD4+ T cells). However, to better capture the continuous heterogeneity of T cell states, we projected cells into a multimodal low-dimensional embedding with canonical correlation analysis (CCA), as previously described (Methods) (10). Each cell received a score along 20 dimensions (canonical variates, CV) defined by orthogonal variation shared between mRNA and surface protein expression, because cross-modality signal is likely to reflect more robust cell states (Fig. S3A). This was demonstrated when we clustered on these CVs in the original study and identified 31 canonical memory T cell states including regulatory, type 1/2/17 helper, and gamma delta T cells, many of which could not be precisely defined in unimodal analysis of mRNA alone (Fig. S3B) (10).
Rather than using CVs to partition cells into clusters, as in the original study, we now used the top CVs as continuous representations of cell state (Fig. 1A). We expected that each CV might represent a distinct, biologically relevant function because clusters delineated by the CVs correspond to known T cell states (Fig. S3B). Indeed, we observed that individual CVs correlate with genes, proteins, and gene sets relevant to well-described T cell functions (Fig. 2A). For example, CV1 correlated with a previously defined cytotoxicity (“innateness”) gene set and CV2 correlated with a regulatory T cell (Treg) gene set (Fig. 2B-C) (23, 24). We confirmed both correlations with gene set enrichment analysis (Table S4). In some instances, published gene sets weren’t significantly enriched, but CVs correlated with marker genes of known memory T cell states. For example, CV4 correlated with TH2 marker GATA3 (Pearson r = 0.23 in non-zero cells, p < 10-1785), and CV8 correlated with gamma delta T cell marker TRDC (Pearson r = 0.51 in non-zero cells, p < 10-767; Fig. 2A, Fig. S3C-F) (25).
Single-cell analyses typically use multiple components of a low-dimensional embedding to define higher-resolution cell states—often clusters—carrying out combinations of functional programs (26, 27). Accordingly, the average CV scores of T cells in each CCA-defined cluster varied (Fig. 2C, Table S5) and can be used to interpret the functional diversity among clusters, e.g., some clusters have more effector function (high CV1), others are more TH1-like (high CV6). However, clusters obscure heterogeneity that exists between individual cells from the same cluster (Fig. 2C). Moreover, continuous metrics like CVs don’t just classify cells into states, but instead capture the degree of how much each state influences a cell, which is a more faithful representation of how activation or helper states manifest in T cells (8, 28). Thus, using the CV scores themselves—or similar metrics defined at single-cell-resolution—may be more precise.
Single-cell statistical models define cell-state-dependent eQTLs
Modeling sparse expression and cell states at single-cell resolution requires statistical models that differ from those commonly used for bulk or pseudobulk eQTL analysis. Here, we used single-cell Poisson mixed-effects (PME) regression, which can model discrete and continuous cell states, Poisson-distributed scRNA-seq UMI counts, and the nested structure of cells within donors and batches (29, 30). We model the UMI counts of each gene in single cells as a function of genotype, correcting for potentially confounding fixed-effect covariates (age, sex, genotype PCs, gene expression PCs) and random-effect covariates (donor, batch) (Fig. 3A, Methods).
To demonstrate consistency with commonly used linear models, we used the PME model to reanalyze our data and successfully recapitulated almost all eQTLs detected in pseudobulk analysis with nominal significance (6,291/6,511=97%) and concordant direction of effect (6,509/6,511=100%, Fig. S4A, Table S6). We permuted genotypes and observed that in this null data, 5.3% (347/6,511) of the eGenes were significant at p<0.05, demonstrating well-calibrated type 1 error. (Fig. S4B). Although donors were part of a former TB progression cohort, 6,510/6,511 eQTLs had no significant differences (q<0.05) between people with and without a history of disease progression (Table S7).
Then, to identify eQTLs with cell-state-dependent effects, we added an interaction term between genotype and cell state (capturing heterogeneity in the eQTL effect) to the PME model. We compared this model to a baseline model controlling for the genotype (overall eQTL effect) and cell state (differential expression) to assess significance (Methods, Fig. 2A). Although this model can accommodate continuous states, in order to compare it to conventional pseudobulk models, we first chose a simple binary test case: CD4+ vs. CD4-, based on surface protein expression measured with CITE-seq (Fig. S5A-D). The total eQTL effect estimated in CD4+ cells (βtotal = βG + βGxCD4) with the PME interaction model was consistent with eQTL analysis with a pseudobulk linear model or a single-cell PME model of CD4+ T cells gated from the CITE-seq dataset (Fig. S5E-F, Table S8-10). Furthermore, using genotype permutations, we demonstrated that type I error for the interaction term was well-controlled at α=0.05 (397/6511 = 0.061, Fig. S5G).
An alternative is to apply a linear mixed-effects (LME) model to normalized single-cell expression data. Without considering cell state, LME performed similarly to PME (Fig. S4C-E, Table S11). However, when we added an interaction term, the LME model was not robust to differential expression between cell states. Even when the only difference between an eQTL’s effect in CD4+ and CD4- cells was due to artificially simulated differential expression, the LME model spuriously detected highly significant state-specific eQTLs, while the PME model did not (Methods, Fig. S6). This is consistent with previous studies showing that LME models inadequately describe the distribution of single-cell expression data (29, 30).
eQTL effects vary systematically along continuous single-cell states
With the single-cell resolution of the PME model, we were able to demonstrate how eQTLs varied across continuous T cell states. We represented cell states with cells’ projections on the individual CVs defined in the original study (Fig. 3A, Fig. S3A) (10). We found that a large proportion of eQTLs are modified by these cell states. Focusing on CV1, which captured cytotoxic function, we observed that 1,097 of 6,511 memory T cell eQTLs had a significant interaction (q < 0.05, Table S12), i.e. the magnitude of the eQTL effect varies in cells depending on their CV1 score. For example, the rs9927852 eQTL for MAF had an interaction effect that amplified the eQTL in cells with higher CV1 scores (βG = 0.098, βGxCV1 = 0.13). This means the eQTL has almost no effect in cells in the lower third of CV1 scores, but increased to maximum effect in the upper third of CV1 scores (Average βtotal in lower third =0.005, average βtotal in upper third = 0.24, Fig. 3B). Interaction effects were independent from differential expression and main genotype effects, and the type 1 error was well-controlled upon permutation of CV1 scores (Fig. S7).
We observed that continuous cell states captured more state-dependent regulatory variation than analogous discrete phenotypes. For example, CD4+ and CD8+ are two major discrete lineages of memory T cells, and continuous CV1 scores largely discriminate between them (classifying CD4+ based on CV1 < 0: sensitivity = 0.85, specificity = 0.93; Fig. S8A). A PME model of eQTL interactions with CV1 recapitulated 517/619 (84%) eQTLs identified in a PME model with the CD4+ state, as expected, but also identified an additional 580 eQTLs uniquely significant in the continuous analysis (Fig. 3C, Fig. S8B). These eQTL interactions’ directions of effect were consistent between CV1 and CD4+, but they were non-significant in the discrete CD4+ analysis (94% concordant effect direction). Similarly, CV2 correlates with Treg markers, and the 1,033 eQTLs with CV2 interactions included but exceeded the 289/388 (74%) eQTLs with significant Treg cluster interaction effects (Fig. 3C, Fig. S8C, Table S13). These correspondences were specific, i.e., the CV1 eQTL interactions were not concordant with Treg cluster interactions and CV2 eQTL interactions were not concordant with CD4+ gate interactions (Fig. S8D). This shows that the continuous states captured by decomposing single-cell data represent biological programs with regulatory significance.
Continuous cell states may also explain heterogeneous eQTL effects better than discrete states do. For example, the MAF eQTL interacts with both CD4+ status (βG = 0.33, βGxCD4 = -0.25, pGxCD4 = 4.69×10-85) and CV1 (βG = 0.098, βGxCD4 = 0.13, pGxCD4 = 1.80×10-243) in a biologically concordant manner, i.e., CD4+ cells tend to have lower CV1 scores and both are associated with weaker MAF eQTL. However, the interaction with CD4+ status was no longer significant when we conditioned on CV1 interactions (p = 0.77), but the CV1 interaction maintained significance (p = 1.19 x 10-231). Closer inspection of minor subsets of CD4+ T cells with higher CV1 scores (15%) and CD8+ T cells with lower CV1 scores (4.7%) confirms this. Both CD4+ and CD8+ memory T cells with high CV1 scores have strong eQTL effects (βCD4,high CV1 = 0.22, βCD8,high CV1 = 0.37, Fig. 3D), while cells of both lineages with low CV1 score have weaker effects (βCD4,low CV1 = 0.03, βCD8,low CV1 = 0.04, Fig. 3D). We observed that the CV1 interaction similarly superseded the discrete CD4+ interaction for 364/517 eQTLs with significant state dependence in both the discrete and continuous models, suggesting that the observed regulatory variation is more driven by a cell’s degree of cytotoxicity than by its lineage.
We hypothesized that multivariate modeling of multiple orthogonal CVs in an eQTL model could capture more granular, multifaceted states and their effects on gene regulation in individual cells. By sequentially adding CVs to a PME model and quantifying their cumulative significance, we observed the number of interacting eGenes reaches a maximum with 7 CVs in the model (2,237 interacting eGenes out of 6511, at LRT q<0.05, Fig. 4A, Table S14); adding more CVs does not substantially change the number of state-dependent eGenes (with 15 CVs: 2,221 interacting eGenes at q<0.05). CV interaction effects from the multivariate model were generally highly concordant with effects from univariate interaction models (r=0.87-0.97, Fig. S9A, Table S15-20), consistent with the independence of orthogonal CVs.
CV1 had the most interacting eGenes in both the univariate and multivariate (7 CV) models (Fig. 4B, Fig. S9B). Some eGenes significantly interacted with multiple cell states (Fig. 4C), and some pairs of states had related directions of effect in the multivariate model: for example, CV1 (cytotoxicity) and CV6 (TH1) tended to have the same direction of effect, while CV1 and CV3 (central) tended to have opposite directions (Fig. 4D-F). By clustering genes based on their interaction z scores (relative to the direction of the main effect) for the 7 CVs in the multivariate model, we defined eight broad clusters of genes based on distinct patterns of CV interactions that may reflect shared cell-state-dependent regulatory mechanisms (Fig. S10, Table S21).
Individual cells have distinct eQTL effects
To estimate the eQTL effect for each gene at single-cell resolution, we can sum the products of interaction betas and corresponding CV scores for each individual cell (Fig. 5A, Methods). These CV scores capture the partial influence of each state that may be modulating regulatory activity. Adding this value to the baseline genotype beta estimates the total cell-level eQTL effect. These single-cell eQTL effects vary across cells and are independent of eGene expression.
Independent variants acting on the same eGene may have different state dependencies
Previous studies suggest that secondary eQTLs identified after conditioning on the lead effect are more likely to be cell-state-specific (31). We indeed observed a significantly larger fraction of secondary eQTLs with significant cell state interactions compared to lead variants (Fig. 5B, Fisher p = 6.98 x 10-52). Of the 436 secondary variants, 71% had cell-state specific effects compared to 34% of lead variants. 212 eGenes had at least two independent state-interacting effects. In some cases, eGenes’ lead and secondary variants may have contradictory interactions with the same CV. For example, the effect of MDGA1’s lead variant increases with CV1, while its secondary effect decreases (Fig. 5C-D). For GNLY, the effect of the lead variant increases with CV4, while its secondary effect decreases. Of the 64 eGenes with at least two independent effects interacting with CV1 (q < 0.05), 30 eGenes had different directions of CV1 interactions for their lead and secondary variants. Across all the CVs, 70 eGenes displayed this discordance with at least one state, demonstrating that cell states may not influence all the eQTLs for a gene in the same way (Fig. S11).
State-dependent eQTLs colocalize with autoimmune-associated variants
Consistent with previous studies, we found that the memory T cell eQTLs that we defined with pseudobulk analysis were enriched for variants in LD with genome-wide significant loci associated with immune traits compared to genome-wide significant loci for all other traits in the GWAS catalog. For example, we observed relative enrichment for rheumatoid arthritis (RA; OR = 4.67, Fisher p = 2.25 x 10-7) and inflammatory bowel disease (OR = 4.84, Fisher p = 2.16 x 10-11, Fig. S12A, Table S22). We recapitulated previously described disease-associated eQTL variants, like rs1893592 (chr21_42434957_A_C), an eQTL for UBASH3A that is associated with RA (32).
We then assessed the cell-state dependence of disease-associated variants that overlapped eQTLs. Cell-state-interacting memory T cell eQTLs were enriched for overlap with GWAS variants compared to non-interacting eQTLs (OR = 1.31, Fisher p = 2.7 x 10-4), and state-dependent eQTL variants overlapped with at least one GWAS variant for 189/195 traits tested from the GWAS Catalog (33). State-interacting eQTLs were nominally enriched compared to non-interacting eQTLs for overlap with 14 individual traits, with associations that exceeded the null expectation (2,237/6,511 = 34%) for immune traits like RA (17 interacting, 7 non-interacting), type 1 diabetes (13 interacting, 7 non-interacting), and multiple sclerosis (24 interacting, 19 non-interacting) (ORs: 1.58-9.57; Fig. S12B, Table S23).
For example, the lead eQTL variant for ORMDL3 (rs4065275) was in LD (r2 = 0.69 in 1KG PEL, r2 = 0.68 in 1KG EUR) with an RA GWAS variant (rs59716545) and also had a significant cell state interaction across CVs 1-7, driven by significant interactions with CVs 1 and 2 (34). The ORMDL3 eQTL was strongest in the GZMB+ cytotoxic CD8+ T cells, intermediate in TH17s and other helper CD4+ states, and weaker in RORC+ Tregs (Fig. 6A). On the other hand, the lead IL18R1 eQTL variant (rs11123923, chr2_102351384_C_A)—in LD with inflammatory bowel disease GWAS variant rs1420098 (r2 = 1.00 in 1KG PEL and EUR)—was strongest in TH2s and TH17s with weaker effects in cytotoxic states (Fig. 6B) (35).
GWAS variants did not always have stronger eQTL effects in states with higher overall expression. For example, the lead eQTL effect for CTLA4 was mediated by rs3087243 (chr2_203874196_G_A), which is associated with RA (36). Although CTLA4 expression is highest in a subset of Tregs, RORC+ Tregs, and activated CD4+ T cells, these cells had weaker eQTL effects (Fig. 6C). The eQTL effect was strongest in cytotoxic CD4+ T cells, a state with very low CTLA4 expression. Our results suggest that disease processes may, in fact, emerge in unlikely states when pathogenic variants modulate low-level gene expression.
State-dependent eQTLs are enriched in T cell regulatory regions
State-dependent eQTLs may be concentrated in particular regulatory regions, including promoters (whose effects are shared across states) or enhancers (which tend to have state-specific regulatory functions) (37). To test these regions for enrichment, we first defined promoters as the region within 2 kb of the transcription start site. We fine-mapped the eQTL effect at each locus with CAusal Variants Identification in Associated Regions (CAVIAR) based on summary statistics from the pseudobulk analysis (38). For loci where we were able to fine-map the lead effect to a single variant (n = 508, posterior inclusion probability [PIP] ≥ 0.5), we calculated a 12.07-fold enrichment of eQTL variants weighted by their PIPs in promoters (permutation p < 0.001, Fig. 6D, Methods). Cell-state-interacting and non-interacting eQTLs were both strongly enriched at 11.38- and 13.43-fold, respectively (p < 0.001, one-sided p for Δint-no int = 0.21), reflecting the regulatory importance of promoters regardless of state.
Next, we hypothesized that enhancers may also be enriched for cell-state specific eQTLs. Since there is uncertainty in the specific location of regulatory regions outside of the promoters, we defined cell-type-specific regulatory regions with Inference and Modeling of Phenotype-related ACtive Transcription (IMPACT), a logistic regression model trained on cell-type-specific transcription factor binding and epigenetic features (39). The IMPACT model for lineage-determining T-bet in CD4+ TH1 cells estimates the probability that the epigenetic landscape at any genomic position is favorable for transcription factor binding, where higher scores represent T-cell-specific regulatory regions. We only considered regions outside the previously defined promoters (TSS +/- 2kb) as T-cell-specific regulatory regions. We found that eQTL variants were enriched 3.12-fold (p < 0.001, Fig. 6D). Cell-state-interacting eQTLs were almost twice as enriched (3.71) as non-interacting eQTLs (2.03) in T-cell-specific regions (both p < 0.001, one-sided permutation p for Δint-no int < 0.001).
To more precisely identify causal variants based on effects shared across ancestries, we combined this dataset with European data from BLUEPRINT and conducted multi-ancestry fine-mapping of pseudobulk effects (40). We identified a single causal variant (PIP ≥ 0.5) explaining the lead effect for 1,247 eGenes also identified in the Peruvian analysis. As in the Peruvian analysis, these variants were enriched in promoters (13.94, p < 0.001) and T-cell-specific regulatory regions outside the promoter (2.94, p < 0.001), with greater enrichment of state-interacting eQTLs (3.74) in T-cell-specific regions compared to non-interacting eQTLs (2.06) (both p < 0.001, one-sided p for Δint-no int < 0.001; Fig. S13).
Previous studies have found that secondary eQTL variants are more likely to affect enhancers than promoters (33). Consistent with this, we found that relative to lead variants, secondary eQTL variants were less enriched in promoters (1.87, p = 0.12) and comparably enriched in T-cell-specific regulatory regions (3.02, p = 0.054) (Fig. 6E). However, only non-interacting secondary variants were enriched in promoters (2.79, p = 0.034), while cell-state-interacting secondary variants were significantly depleted (0.13, p = 0.018). There was no difference in enrichment in T-cell-specific regions between interacting (3.92, p = 0.008) and non-interacting variants (2.53, p = 0.056) (one-sided p for Δint-no int = 0.16). This suggests that secondary variants generally have more cell-type-specific regulatory roles, regardless of cell-state-dependence, but those found to be state-dependent in the PME model are especially depleted for shared effects.
Discussion
Large single-cell datasets from genotyped cohorts—some with multiple single-cell data modalities—are becoming more common and make it possible to investigate how cell states shape the complex relationship between genetic variation, gene expression, and disease. In this study, we underscore the untapped potential of these data to reveal state-dependent regulatory heterogeneity when analyzed with traditional bulk methods and the urgent need to refocus eQTL analyses at single-cell resolution.
Recognizing growing evidence that clusters obscure the rich functional diversity of T cells and other dynamic cell types such as stem cells, stromal cells, and neurons, here we leveraged the granularity of single-cell data to better define state-dependent eQTLs (41–43). A single-cell Poisson model is computationally expensive—as noted by the few previous studies that have had mixed success with similar approaches—but its advantages over more common alternatives were most clear when we assessed state dependence: pseudobulk linear models cannot accommodate cell states defined at single-cell resolution, and a single-cell linear model was confounded by differential expression between states (18, 44). PME’s flexibility and robustness are important assets for effective state-dependent eQTL analysis.
Modeling continuous cell states in the PME model explained more overlooked variation, for example in rarer states like cytotoxic CD4+ T cells, which have been traditionally aggregated with other CD4+ T cells despite bearing more regulatory similarity to CD8+ T cells. This highlights the limitations of traditional discrete T cell states, which ignore the continuous ranges of T cell functions like activation, cytotoxicity, or helper lineages. However, continuous cell states can be difficult to interpret biologically, especially as more dimensions are considered jointly. We used multimodal CCA of gene expression and surface proteins for more robust definition and easier interpretation of these states. For other data modalities or cell types, alternative integration strategies may be effective (45, 46). Single-cell trajectories may reveal eQTLs varying along a unidimensional axis, like a perturbation or differentiation (17).
With these strategies, we identified state-dependent effects in a substantial proportion of eQTLs and reconstructed their joint effects in individual cells. Single-cell-resolution eQTL betas estimate the effect of a genetic variant on gene expression in any cell with a given cell-state profile, revealing that genetic variants can have different effects even in pairs of cells that share aspects of their states (for example, in the same cluster). We can not only identify conventional clusters in which disease-associated variants have different effects—such as CD8+ GZMB+ T cells for RA-associated rs4065275 near ORMDL3—but unbiasedly disentangle specific continuous states driving the overall eQTL that may transcend clusters, such as cytotoxicity for rs4065275 and ORMDL3.
These T cell states may be driven by distinct regulatory architectures, including transcription factors, epigenetic profiles, or chromatin accessibility patterns. State-dependent eQTLs may be in genomic positions that are only involved in regulatory activity in certain states—and the loci with independent eQTLs that have opposing state-dependent effects suggest that the exact position or nature of a variant determines these regulatory interactions. This study offers a starting point to design studies that further probe single-cell regulatory heterogeneity. For example, integrating single-cell ATAC-seq with RNA-seq in eQTL studies may offer insight into the overlap between these variants and state-specific accessible chromatin, or incorporating interactions between states or with abundance of a cell state to understand how the immune milieu shapes eQTL effects. Single-cell-resolution eQTLs can introduce new paradigms of how genes are regulated across diverse cell states.
Materials and Methods
Single-cell RNA-seq and genotype data and quality control (QC)
We previously published a dataset of memory T cells from a 259-donor subset of a Peruvian tuberculosis disease progression cohort (128 former cases, 131 former latently infected controls; GEO: GSE158769) along with detailed sample processing methods (6, 42). Briefly, we negatively isolated memory T cells with a modified Pan T cell magnetic-activated cell sorting (MACSR) kit with anti-CD45RA biotin and followed an optimized version of the CITE-seq protocol with TotalSeqTM-A (BioLegend) oligonucleotide-labeled antibodies for a panel of 31 surface proteins (43). We pooled cells into batches of six donors for 10x Genomics library preparation and sequenced on an Illumina HiSeq X. Reads were aligned to GRCh38 with Cell Ranger. After demultiplexing donors with Demuxlet, we removed cells labeled as doublets, with < 500 genes expressed or > 20% of unique molecular identifiers (UMIs) from mitochondrial genes, from samples whose genotypes did not match genotypes called from single-cell data, or lacking surface markers of memory T cells (CD3 and CD45RO) (44).
A superset of 4,002 donors was genotyped in a separate genetic study on a custom Affymetrix array (LIMAArray) based on whole exome sequencing from 116 individuals with active TB from the same Peruvian cohort (dbGaP: phs002025). The design of this array has been described previously (45). We removed variants that were significantly associated with batch (p < 1 x 10-5), duplicated, or had low call rate, significant differences in the missingness rate between cases and controls (> 10-5), or Hardy-Weinberg p value < 10-5 in controls.
We mapped variants to GRCh37/b37 and used SHAPEIT2 to pre-phase genotypes and IMPUTE2 to impute genotypes with 1000 Genomes Project Phase 3 as the reference panel (46, 47). After removing SNPs with an INFO score < .9, minor allele frequency < 0.05, or deletions, the remaining variants were converted to GRCh38 with liftOver.
After single-cell and genotype QC, we used 500,089 cells from 259 donors and 5,460,354 variants for eQTL analysis.
Single-cell data processing
mRNA and protein data were processed separately, as described (6). Briefly, we normalized the UMIs for each gene in each cell to log(counts per 10,000) and used centered-log-ratio normalization for each protein within each cell. Normalized mRNA and protein expression were scaled so that each feature had mean = 0, variance = 1 across all cells. After selecting the top 1,000 variable genes per donor and removing the mouse immunoglobulin G protein (control), we conducted principal component analysis of each modality with the irlba R package and corrected the top 20 PCs of each modality for donor and library preparation batch effects with Harmony (48).
Pseudobulk eQTL analysis
To make pseudobulk expression profiles for 259 donors, we removed cells from 12 technical replicate samples and summed the UMI counts for each gene across all cells from each donor, producing one aggregated expression value for each gene in each donor. For CD4+ and CD8+ pseudobulk analysis, we constructed pseudobulk expression profiles for each donor in each compartment by in-silico gating cells that were CD4+CD8- or CD8+CD4-, respectively, based on their normalized surface protein expression measured in CITE-seq. Gates were defined through visual inspection. For Treg pseudobulk analysis, we used previously defined cluster annotations to construct a pseudobulk expression profile for the cells in clusters C-5 (RORC+ Treg) and C-9 (Treg) (6).
Genes were removed if expressed in fewer than half the donors (pseudobulk counts > 0 in ≤ 129 donors). For the remaining 15,789 genes, we normalized the pseudobulk profiles to log2(counts per million + 1) and applied inverse normal transformation. Then, we used probabilistic estimation of expression residuals (PEER) implemented in R to regress out age, sex, five genotype PCs, and 45 PEER factors (49).
We then conducted a whole-genome eQTL analysis for all 22 autosomal chromosomes. For each gene, we associated its residual expression after PEER normalization with the dosage at each SNP within 1 MB of the transcription start site. These models were implemented in FastQTL (default settings) (50). To ensure robustness, we used FastQTL’s beta approximation to compute a permutation p value from 1,000 permutations. To correct for multiple hypothesis testing, we calculated q values for the lead SNP for each gene, and identified eGenes with significant eQTL variants with q < 0.05 (51).
For conditional pseudobulk analysis, we iteratively regressed each eGene’s PEER-normalized residuals on the dosage of its lead SNP and used the subsequent values for FastQTL analysis. We repeated this twice.
Comparison to the BLUEPRINT project
To validate this model’s ability to detect previously characterized eQTLs, we used the naive CD4+ T cell eQTLs reported by the BLUEPRINT project (15). We selected eGenes that were significant in our dataset (n = 6,511) and identified the subset of eGene/lead variant pairs also measured by BLUEPRINT (n = 3,249) and significant in both datasets at q < 0.05 (n = 2,056). We compared the direction of effect for these variants between the two datasets.
Continuous cell state definition and annotation
For multimodal dimensionality reduction, we used canonical correlation analysis, as implemented in the cc function from the CCA R package (52). We ran CCA on scaled mRNA expression for the most variable genes (excluding T cell receptor genes) and scaled protein expression for all 30 memory T cell proteins, and computed cells’ scores on each canonical variate (CV) based on the weight of each gene on each CV. We then corrected scores on the top 20 CVs for donor and batch effects. For visualization, we projected this embedding into a two-dimensional Uniform Manifold Approximation Projection (UMAP) with the umap function from the uwot R package (53).
To annotate each CV based on its biological correlates, we first measured the Pearson correlation coefficient between cells’ CV scores and the normalized expression of each surface protein marker we measured. We measured correlations between CV scores and the normalized expression of genes encoding lineage-defining genes.
We also conducted gene set enrichment analysis. First, we measured the correlation of each CV prior to batch correction with the expression of each gene used as input for CCA. These correlations defined the ranked gene list for each CV. Then, we measured the enrichment of each immunologic gene set (C7, only those annotated as “UP”) in MSigDB and a published “innateness” gene list in each CV’s ranked gene list with the fgsea function in the fgsea R package (54, 55). We corrected for multiple hypothesis testing with a Bonferroni p value threshold adjusted for 2,360 gene sets tested (0.5/2,360 = 2.12 x 10-5).
To compare CV scores between and within clusters, we used the cluster annotations defined in the previous study and computed the average score on each CV for cells from each cluster. We also randomly selected two cells each from clusters C-3 and C-15 to compare to each other and the cluster.
Single-cell eQTL modeling
We modeled single-cell eQTLs with a Poisson model of each gene’s UMI counts as a function of genotype at the eQTL variant and other donor- and cell-level covariates, for each gene: where E is the expression of the gene in cell i, θ is an intercept, and all other βs represent fixed effects as indicated (nUMI = number of UMI, MT = proportion of mitochondrial UMIs, gPC= genotype PC, ePC=single-cell mRNA expression PC) for covariates in cell i, donor d, or batch b. Donor and batch are modeled as random effect intercepts.
To test interactions with cell state, we added a fixed effect for cell state and a cell state x genotype interaction term: When testing whether the discrete state (CD4+) or the continuous state (CV1) captured more variance, we included cell state and cell state interaction terms for both and removed each interaction term to create the corresponding null model for the likelihood ratio test (described below).
To test interactions with former TB progression status, we used the same model but with a fixed effect and genotype-interaction term for TB status.
To test interactions with multiple state-defining covariates (e.g., multiple CVs), we included additive fixed effect and interaction terms for each CV: To test eQTLs within discrete cell states (e.g., CD4+), we subsetted the full dataset to cells in the state of interest (using gates or clusters). Then, we ran the Poisson single-cell model without any cell state terms.
We fit all single-cell Poisson mixed models with the glmer function in the lme4 R package, with family=“poisson”, nAGQ=1, and control=glmerControl(optimizer = “nloptwrap”) (56). To determine the significance of this model, we used a likelihood ratio test comparing the models with and without the genotype term (for the memory T cell analysis) or the cell state interaction term(s) (for the cell-state-specific analyses) and calculated a p value for the test statistic against the Chi-squared distribution with one degree of freedom. We corrected for multiple hypothesis testing by calculating q values across all tested eQTLs.
For comparison, we also used a single-cell linear mixed effects (LME) model to test eQTLs across all memory T cells and for state dependence with CD4+ state or continuous CV1. The model included the same covariates as the Poisson models. For example, across all memory T cells: where G is the log2(counts per 10K) normalized expression of the gene in cell i. We fit all single-cell linear mixed models with the lmer function in the lme4 R package, with REML = F (56) and determined the significance of the model as described for Poisson models.
Type 1 error estimation
To estimate the false positive rate for the single-cell PME model of memory T cell eQTLs, we permuted genotype across donors and ran the PME model for each gene (n = 6,511). Then, we used a likelihood ratio test to compute a p value for each gene under the permutation and measured the proportion of genes with a p value < alpha = 0.05. To estimate the false positive rate for the single-cell PME model of cell state-dependent eQTLs, we used the same permutation approach but permuted cell state across cells to preserve the main genotype effect.
Simulating differential expression
We selected genes with non-significant cell state and cell-state interaction terms in the PME model with CD4+ as the cell state of interest. To test robustness of the single-cell models to differential expression, we uniformly reduced the expression level of each gene in each cell to 50%, 20%, and 10% of baseline expression. For PME, we did this in log2-space, converted back to counts, and rounded to the nearest whole number. For LME, we reduced expression in log2(counts per 10,000) space. Then, we ran the single-cell Poisson model with cell-state interaction (cell state = CD4+ cells).
Clustering eQTLs and cells
We stringently selected cell-state-dependent eGenes (LRT p value from modeling CVs1-7 < 0.05/6511 = 7.7 x 10-6. For each eGene, we extracted the z score for each cell state interaction term from the Poisson model and multiplied them by the sign of the main genotype beta to standardize directions of effect, i.e., positive value means interaction amplifies baseline genotype effect, negative value means interaction dampens effect. Using Seurat, we built a shared nearest neighbor graph and used Louvain clustering with n.start = 20, n.iter = 20, and resolution = 1.5 to define eight clusters of eGenes (57).
We explored the potential biological significance of the clusters by measuring the enrichment of MSigDB Hallmark and Gene Ontology gene sets. For each gene set, we used a Fisher’s exact test to compare the proportion of eGenes in the cluster that overlapped with the gene set versus the proportion in other clusters that overlapped with the gene set. We assessed significance with a Bonferroni threshold of 0.05/14,765 gene sets tested = 3.4×10-6.
Cumulative eQTL interaction effect
The effect of each eQTL in each cell is the cumulation of the main genotype effect and all of the genotype x CV interactions. We calculated the overall effect for each eQTL in each cell by summing the genotype beta and the products of each CV score and the corresponding beta: For interpretation in some analyses, we defined cluster-level betas by averaging cell-level betas for all cells assigned to that cluster in the previous study.
GWAS variant enrichment
We downloaded the GWAS Catalog as of July 30, 2020, restricted to GWAS in European populations, and identified variants associated with each of 194 traits at p < 5×10-8 and pruned with plink to remove variants with LD r2 > 0.2 (29, 58). We also constructed a set of background variants by pooling variants associated with any trait at p < 5×10-8 and pruning with plink to remove variants with LD r2 > 0.2.
For each memory-T cell eQTL, we identified all other variants with LD r2 ≥ 0.5 in both the 1000 Genomes Peruvians in Lima, Peru (PEL) and European (EUR) populations. We then matched these variants with variants from the GWAS Catalog for 194 traits and the background variant set. We calculated all enrichments with a two-sided Fisher test. To calculate memory-T-cell eQTL enrichments for specific traits, we compared the proportion of eQTL-colocalizing GWAS variants for each trait with the proportion of eQTL-colocalizing background variants. To calculate the enrichment of GWAS variants colocalizing with state-dependent eQTLs, we compared the proportion of eQTLs with significant cell-state interaction (LRT q < 0.05 from the model with 7 CVs) that colocalize with the background variant set compared to the proportion of eQTLs without significant cell-state interaction that colocalize.
Fine-mapping memory-T cell eQTLs
For each locus (eGene), we used CAusal Variants Identification in Associated Regions (CAVIAR) software allowing only a single causal variant in each locus (-c 1) to estimate the probability that each variant in a +/- 250kb window around the transcription start site (TSS) is causal (34). We ran CAVIAR on pseudobulk eQTL z scores for these variants and pairwise Pearson correlation coefficients between the variants (calculated with plink version v1.9b) (58).
For joint multi-ancestry analysis, we first lifted the BLUEPRINT dataset to GRCh38 with liftOver and filtered at a minor allele frequency threshold of 0.05. We then merged the TBRU and BLUEPRINT datasets matching on chromosome, position, reference, and alternate alleles and performed eQTL analysis on the joint datasets as described above. We fine-mapped each locus with CAVIAR using z scores from the joint analysis, as described in the Peruvian dataset.
Enrichment of eQTLs in regulatory regions
We defined promoters as the region +/- 2kb from the transcription start site of each of the 6,511 significant eGenes based on the Cell Ranger 3.1.0 GTF. This annotation was binary.
We defined cell-state-specific regulatory regions with a probabilistic annotation of the genome by Inference and Modeling of Phenotype-related ACtive Transcription (IMPACT) (35). First, we collected public T-bet (TBX21) ChIP-seq data for in CD4+ TH1 cells from NCBI as a gold standard for CD4+ TH1 regulatory elements (59). We also previously aggregated 5,345 public epigenetic features from NCBI, ENCODE, and Roadmap spanning all possible cell types (60). Then, we used IMPACT’s logistic regression model to distinguish 1,000 T-bet bound sequence motifs from 10,000 unbound T-bet sequence motifs genome-wide based on epigenetic feature characterization. We used HOMER [v.4.8.3] to identify T-bet sequence motif matches as previously done (35, 61). We then characterized every nucleotide genome-wide using the same set of epigenetic features and estimated the probability (between 0 and 1) of a regulatory element important to the cell type.
For the multi-ancestry analyses, we restricted loci to those that were also significant eQTLs in the Peruvian-only eQTL analysis. We computed enrichments in each locus containing a variant with posterior inclusion probability ≥ 0.5 and averaged across loci. To compute enrichments for the binary promoter annotation, we determined whether each variant in the locus overlapped with a promoter region (X = 0 if no overlap, X = 1 if overlap). Then, we calculated the enrichment across n variants in the locus as: To compute enrichments for the probabilistic IMPACT T cell regulatory region annotations, we used the same strategy but X = the IMPACT score between 0 and 1 for each variant. We determined the significance of each enrichment by comparing the true enrichment score to a null distribution constructed by permuting the PIPs across variants in each locus 1,000 times and calculating an enrichment score. We determined the significance of the difference between the enrichments in interacting vs. non-interacting eGenes by comparing the true difference to a null distribution constructed by permuting interacting vs. non-interacting labels across eGenes and calculating an enrichment score. Both p values were computed with a one-sided comparison.
Captions for Tables S1-23
Table S1. Pseudobulk memory T cell eQTLs in a Peruvian cohort. (See Nathan_etal_SuppTables.xlsx, tab 1) Lead eQTL variant for each eGene tested in a linear model of pseudobulk gene expression assayed in memory T cells from a Peruvian cohort. All SNP coordinates are from GRCh38. The model adjusts for donor’s age and sex, 5 genotype PCs, and 45 PEER factors. P values are from FastQTL beta approximation-based permutation.
Table S2. Ancestry-specific eQTL variants. (See Nathan_etal_SuppTables.xlsx, tab 2) eQTL variants in the Peruvian dataset driven by variants that are rare (MAF < 0.05) in 1KG EUR population.
Table S3. Conditional pseudobulk memory T cell eQTLs. (See Nathan_etal_SuppTables.xlsx, tab 3) Secondary eQTL variant for each eGene tested in a linear model of pseudobulk gene expression assayed in memory T cells from a Peruvian cohort, after regressing out the lead effect. The model adjusts for donor’s age and sex, 5 genotype PCs, and 45 PEER factors. P values are from FastQTL beta approximation-based permutation.
Table S4. Gene set enrichment for loadings on CVs 1-3 (See Nathan_etal_SuppTables.xlsx, tab 4) Top 10 gene sets enriched for CVs1-3 based on genes’ loadings on each CV. P values and enrichment statistics are from the fgsea R package.
Table S5. Average CV scores by memory T cell cluster (See Nathan_etal_SuppTables.xlsx, tab 5) Average score along each CV for cells in each cluster. Clusters were defined in Nathan, et al. by projecting cells into a low-dimensional embedding based on CCA of paired mRNA and surface protein, constructing a shared nearest neighbor graph, and conducting Louvain clustering at resolution = 2. Clusters were annotated based on differentially expressed genes and proteins.
Table S6. Single-cell Poisson model of memory T cell eQTLs. (See Nathan_etal_SuppTables.xlsx, tab 6) eQTL effects calculated with the PME model (without cell state interactions) for significant eGenes and lead variants identified in the pseudobulk analysis. The model adjusts for donor’s age and sex, percent MT UMIs and number of UMIs per cell, 5 genotype PCs, 5 expression PCs, and has random effects for donor and library preparation pool. P values are from an LRT comparing the model with and without the genotype term.
Table S7. Single-cell Poisson model of memory T cell eQTLs’ interaction with prior TB status. (See Nathan_etal_SuppTables.xlsx, tab 7) eQTL interactions with donors’ prior TB progression status calculated with the PME model for significant eGenes and lead variants identified in the pseudobulk analysis. The model adjusts for donor’s age and sex, percent MT UMIs and number of UMIs per cell, 5 genotype PCs, 5 expression PCs, and has random effects for donor and library preparation pool. P values are from an LRT comparing the model with and without the TB status term.
Table S8. Single-cell Poisson model of memory T cell eQTLs’ dependence on CD4+ state. (See Nathan_etal_SuppTables.xlsx, tab 8) eQTL interactions with cells’ CD4+ state calculated with the PME model for significant eGenes and lead variants identified in the pseudobulk analysis. CD4+ cells were defined based on normalized surface protein expression measured in CITE-seq (CD4+CD8-). The model adjusts for donor’s age and sex, percent MT UMIs and number of UMIs per cell, 5 genotype PCs, 5 expression PCs, and has random effects for donor and library preparation pool. P values are from an LRT comparing the model with and without the CD4+ state interaction term.
Table S9. Pseudobulk memory T cell eQTLs in CD4+ cells only. (See Nathan_etal_SuppTables.xlsx, tab 9) eQTLs in CD4+ memory T cells calculated with a pseudobulk linear model in FastQTL for significant eGenes and lead variants identified in the pseudobulk analysis. CD4+ cells were defined based on normalized surface protein expression measured in CITE-seq (CD4+CD8-). The model adjusts for donor’s age and sex, 5 genotype PCs, and 45 PEER factors and we calculated a nominal P value for the genotype effect.
Table S10. Single-cell Poisson model of memory T cell eQTLs in CD4+ cells only. (See Nathan_etal_SuppTables.xlsx, tab 10) eQTL in CD4+ memory T cells calculated with the PME model for significant eGenes and lead variants identified in the pseudobulk analysis. CD4+ cells were defined based on normalized surface protein expression measured in CITE-seq (CD4+CD8-). The model adjusts for donor’s age and sex, percent MT UMIs and number of UMIs per cell, 5 genotype PCs, 5 expression PCs, and has random effects for donor and library preparation pool. P values are from an LRT comparing the model with and without genotype term.
Table S11. Single-cell linear model of memory T cell eQTLs. (See Nathan_etal_SuppTables.xlsx, tab 11) eQTL effects calculated with the LME model (without cell state interactions) for significant eGenes and lead variants identified in the pseudobulk analysis. The model adjusts for donor’s age and sex, percent MT UMIs and number of UMIs per cell, 5 genotype PCs, 5 expression PCs, and has random effects for donor and library preparation pool. P values are from an LRT comparing the model with and without the genotype term.
Table S12. Single-cell Poisson model of memory T cell eQTLs’ dependence on CV1. (See Nathan_etal_SuppTables.xlsx, tab 12) eQTL interactions with cells’ CV1 calculated with the PME model for significant eGenes and lead variants identified in the pseudobulk analysis. The model adjusts for donor’s age and sex, percent MT UMIs and number of UMIs per cell, 5 genotype PCs, 5 expression PCs, and has random effects for donor and library preparation pool. P values are from an LRT comparing the model with and without the CV1 interaction term.
Table S13. Single-cell Poisson model of memory T cell eQTLs’ dependence on Treg state. (See Nathan_etal_SuppTables.xlsx, tab 13) eQTL interactions with Tregs calculated with the PME model for significant eGenes and lead variants identified in the pseudobulk analysis. The Tregs were identified through clustering in Nathan, et al. The model adjusts for donor’s age and sex, percent MT UMIs and number of UMIs per cell, 5 genotype PCs, 5 expression PCs, and has random effects for donor and library preparation pool. P values are from an LRT comparing the model with and without the Treg interaction term.
Table S14. Single-cell Poisson model of memory T cell eQTLs’ dependence on CVs 1-7. (See Nathan_etal_SuppTables.xlsx, tab 14) eQTL interactions with CVs1-7 calculated with the PME model for significant eGenes and lead variants identified in the pseudobulk analysis. The model adjusts for donor’s age and sex, percent MT UMIs and number of UMIs per cell, 5 genotype PCs, 5 expression PCs, and has random effects for donor and library preparation pool. P values are from an LRT comparing the model with and without the CV1-7 interaction terms.
Table S15. Single-cell Poisson model of memory T cell eQTLs’ dependence on CV2. (See Nathan_etal_SuppTables.xlsx, tab 15) eQTL interactions with cells’ CV2 calculated with the PME model for significant eGenes and lead variants identified in the pseudobulk analysis. The model adjusts for donor’s age and sex, percent MT UMIs and number of UMIs per cell, 5 genotype PCs, 5 expression PCs, and has random effects for donor and library preparation pool. P values are from an LRT comparing the model with and without the CV2 interaction term.
Table S16. Single-cell Poisson model of memory T cell eQTLs’ dependence on CV3. (See Nathan_etal_SuppTables.xlsx, tab 16) eQTL interactions with cells’ CV3 calculated with the PME model for significant eGenes and lead variants identified in the pseudobulk analysis. The model adjusts for donor’s age and sex, percent MT UMIs and number of UMIs per cell, 5 genotype PCs, 5 expression PCs, and has random effects for donor and library preparation pool. P values are from an LRT comparing the model with and without the CV3 interaction term.
Table S17. Single-cell Poisson model of memory T cell eQTLs’ dependence on CV4. (See Nathan_etal_SuppTables.xlsx, tab 17) eQTL interactions with cells’ CV4 calculated with the PME model for significant eGenes and lead variants identified in the pseudobulk analysis. The model adjusts for donor’s age and sex, percent MT UMIs and number of UMIs per cell, 5 genotype PCs, 5 expression PCs, and has random effects for donor and library preparation pool. P values are from an LRT comparing the model with and without the CV4 interaction term.
Table S18. Single-cell Poisson model of memory T cell eQTLs’ dependence on CV5. (See Nathan_etal_SuppTables.xlsx, tab 18) eQTL interactions with cells’ CV5 calculated with the PME model for significant eGenes and lead variants identified in the pseudobulk analysis. The model adjusts for donor’s age and sex, percent MT UMIs and number of UMIs per cell, 5 genotype PCs, 5 expression PCs, and has random effects for donor and library preparation pool. P values are from an LRT comparing the model with and without the CV5 interaction term.
Table S19. Single-cell Poisson model of memory T cell eQTLs’ dependence on CV6. (See Nathan_etal_SuppTables.xlsx, tab 19) eQTL interactions with cells’ CV6 calculated with the PME model for significant eGenes and lead variants identified in the pseudobulk analysis. The model adjusts for donor’s age and sex, percent MT UMIs and number of UMIs per cell, 5 genotype PCs, 5 expression PCs, and has random effects for donor and library preparation pool. P values are from an LRT comparing the model with and without the CV6 interaction term.
Table S20. Single-cell Poisson model of memory T cell eQTLs’ dependence on CV7. (See Nathan_etal_SuppTables.xlsx, tab 20) eQTL interactions with cells’ CV7 calculated with the PME model for significant eGenes and lead variants identified in the pseudobulk analysis. The model adjusts for donor’s age and sex, percent MT UMIs and number of UMIs per cell, 5 genotype PCs, 5 expression PCs, and has random effects for donor and library preparation pool. P values are from an LRT comparing the model with and without the CV7 interaction term.
Table S21. GO Term enrichment in eGene clusters. (See Nathan_etal_SuppTables.xlsx, tab 21) Top 5 Gene Ontology term gene sets enriched for overlap with eGenes in each of the eight eGene clusters. These clusters were defined through Louvain clustering on the z scores for each eGene’s interactions with each of the 7 CVs, with signs corrected to be relative to the main genotype effect for that eGene. P values are from a one-sided Fisher test.
Table S22. eQTL enrichment among GWAS variants by trait. (See Nathan_etal_SuppTables.xlsx, tab 22) Enrichments of memory T cell eQTLs in variants associated with traits in the GWAS Catalog. Variants were considered to be overlapping if they had r2 > .5 in both 1KG EUR and PEL populations. P values are from a Fisher test comparing the proportion of eQTLs in a trait’s GWAS variants to the proportion of eQTLs in GWAS variants for all other GWAS Catalog traits.
Table S23. State-dependent eQTL enrichment among GWAS variants by trait. (See Nathan_etal_SuppTables.xlsx, tab 23) Enrichments of state-dependent memory T cell eQTLs in variants associated with traits in the GWAS Catalog. Variants were considered to be overlapping if they had r2 > .5 in both 1KG EUR and PEL populations. P values are from a Fisher test comparing the proportion of state-dependent eQTLs in a trait’s GWAS variants to the proportion of non-state-dependent eQTLs in GWAS variants for that trait.
Acknowledgment
Funding
National Institutes of Health grant U19AI111224 (SR, DBM, MM)
National Institutes of Health grant UH2AR067677 (SR)
National Institutes of Health grant T32HG002295 (AN)
National Institutes of Health grant T32AR007530 (AN)
National Institutes of Health grant U01HG009379 (SR)
National Institutes of Health grant R01AI049313 (DBM)
National Institutes of Health grant R01AR063759 (SR)
Author contributions
Conceptualization: AN, SR
Methodology: AN, SR, AP
Formal analysis: AN, SA, KI, TA, YL
Investigation: JIB, YB, SS Resources: LL, MBM
Supervision: SR, DBM, MBM Writing – original draft: AN, SR
Writing – review & editing: all authors
Competing interests
Authors declare that they have no competing interests.
Data and materials availability
All data are available in the main text or the supplementary materials. Code will be available on Github upon acceptance.