Abstract
Background The coronavirus disease 2019 (COVID-19) is an infectious disease that mainly affects the host respiratory system with ~80% asymptomatic or mild cases and ~5% severe cases. Recent genome-wide association studies (GWAS) have identified several genetic loci associated with the severe COVID-19 symptoms. Delineating the genetic variants and genes is important for better understanding its biological mechanisms.
Methods We implemented integrative approaches, including transcriptome-wide association studies (TWAS) and colocalization analysis, to interpret the genetic risks using two independent GWAS datasets in lung and immune cells. We further performed single cell transcriptomic analysis on a bronchoalveolar lavage fluid (BALF) dataset from moderate and severe COVID-19 patients.
Results We discovered and replicated the genetically regulated expression of CXCR6 and CCR9 genes. These two genes have a protective effect on the lung and a risk effect on whole blood, respectively. The colocalization analysis of GWAS and cis-expression quantitative trait loci highlighted the regulatory effect on CXCR6 expression in lung and immune cells. In the lung resident memory CD8+ T (TRM) cells, we found a 3.32-fold decrease of cell proportion and lower expression of CXCR6 in the severe than moderate patients using the BALF transcriptomic dataset.
Conclusion CXCR6 from the 3p21.31 locus is associated with severe COVID-19. CXCR6 tends to have a lower expression in lung TRM cells of severe patients, which aligns with the protective effect of CXCR6 from TWAS analysis. We illustrate one potential mechanism of host genetic factor impacting the severity of COVID-19 through regulating the expression of CXCR6 and TRM cell proportion. Our results shed light on potential therapeutic targets for severe COVID-19.
Introduction
The coronavirus disease 2019 (COVID-19) pandemic has already infected over 100 million people and caused numerous morbidities and over 2 million death worldwide as of January 2021. The virus is evolving fast with new variants being emerged in the world [1, 2]. A huge disparity in the severity of symptoms in different patients has been observed. In some of the patients, only mild symptoms or even no symptoms are shown and little treatment or interventions are required while a subset of patients experience rapid disease progression to respiratory failure and need urgent and intensive care [3]. Although age and sex are major risk factors of COVID-19 disease severity [4], it remains largely unclear about the factors leading to the variability on COVID-19 severity and which group of individuals confer intrinsic susceptibility to COVID-19.
Several genome-wide association studies (GWAS) have been carried out and one genomic risk locus, 3p21.31, has been replicated to be associated with the critical illness. One recent study by the Severe COVID-19 GWAS Group identified 3p21.31 risk locus for the susceptibility to severe COVID-19 with respiratory failure [5]. This GWAS signal was then replicated in a separate meta-analysis comprising in total 2,972 cases from 9 cohorts by COVID-19 Host Genetics Initiative (HGI) round 4 alpha. However, there is a cluster of 6 genes (SLC6A20, LZTFL1, CCR9, FYCO1, CXCR6, and XCR1) nearby the lead SNP rs35081325 within a complex linkage disequilibrium (LD) structure, which makes the “causal” gene and functional implication of this locus remain elusive [5, 6].
The majority of GWAS variants are located in non-coding loci, many of which are in the enhancer or promoter regions, playing roles as cis- or trans-regulatory elements to alter gene expression [7]. Although the function of non-coding variants could not be directly interrupted by their locations, their mediation effect on gene expression could be inferred by the expression quantitative trait loci (eQTL) analysis. In recent years, large consortia like GTEx (Genotype-Tissue Expression), eQTLGen Consortium, and DICE (database of immune cell expression) have generated rich eQTLs resources in diverse tissues and immune-related cell types [7-9]. A variety of statistical approaches such as transcriptome-wide association study (TWAS) analysis and colocalization analysis have successfully interpreted the target genes of non-coding variants by integrating the context-specific eQTLs [10-13].
Recent advances in single cell transcriptome sequencing provide unprecedented opportunities to understand the biological mechanism underlying disease pathogenesis at the single cell and cell type levels [14-16]. The recent generation of single cell RNA-sequencing (scRNA-seq) data from the bronchoalveolar lavage fluid (BALF) of moderate and severe COVID-19 patients has revealed the landscape of the gene expression changes in major immune cells. However, the transcriptome alteration in specific subpopulations remains mostly unexplored [17].
In this study, we aimed to connect the genetic factors with the context-specific molecular phenotype in COVID-19 patients. As illustrated in Figure 1, we designed a multi-level workflow to dissect the genetically regulated expression (GReX) that contributed to severe COVID-19. We performed TWAS and colocalization analyses with a broad collection of eQTL datasets at the tissue and cellular levels. We further integrated the BALF single cell transcriptome dataset to explore the cellular transcriptome alterations in severe and moderate COVID-19 patients. Lastly, we proposed a hypothetical mechanism, connecting our multi-layer evidence in host genetic factors, gene (CXCR6), and single cell transcriptome features with the severity of COVID-19.
Materials and methods
GWAS dataset
We obtained GWAS summary statistics for the phenotype “severe COVID-19 patients vs population” (severe COVID-19) from two separate meta-analyses carried out by the COVID-19 Host Genetics Initiative (HGI, https://www.covid19hg.org/) and the Severe COVID-19 GWAS Group (SCGG) [5]. The GWASHGI A2 round 4 (alpha) cohort consists of 12,816,037 SNPs from the association study of 2,972 very severe respiratory confirmed COVID-19 cases and 284,472 controls with unknown SARS-CoV-2 infection status from nine independent studies in a majority of the European Ancestry population. The GWASSCGG dataset is from the first GWAS of severe COVID-19 [5], including 8,431,427 SNPs from the association study conducted from 1,980 COVID-19 confirmed patients with severe disease status and 2,205 control participants from two separate cohorts in Europe.
Transcriptome-wide association analysis
We performed TWAS analyses of severe COVID-19 using S-PrediXcan [18] to prioritize GWAS findings and identify eQTL-linked genes. S-PrediXcan is a systematic approach that integrates GWAS summary statistics with publicly available eQTL data to translate the evidence of association with a phenotype from the SNP level to the gene level. Briefly, prediction models were built by a flexible and generic approach multivariate adaptive shrinkage in R package (MASHR) using variants with a high probability of being causal for QTL and tissue expression profiles from the GTEx version 8 [7, 19]. We chose three tissues that were relevant to SARS-CoV-2 infection, including lung, whole blood, and spleen. Then, we ran S-PrediXcan scripts (downloaded from https://github.com/hakyimlab/MetaXcan, accessed on 10/10/2020) with each of the three tissue-specific models in two severe COVID-19 GWAS datasets respectively. The threshold used in TWAS significance was adjusted by Bonferroni multiple test correction with the ~10,000 genes. We defined the strict significance as p < 5 × 10 -6 (|z| > 4 .56) and suggestive significance as p < 5 × 10-5 (|z| > 4.06).
Colocalization analysis
Colocalization was performed to validate significant TWAS associations using two recent and cutting-edge statistical analysis approaches: eCAVIAR [20] and fastENLOC [21], which aim to identify a single genetic variant that has shared causality between expression and GWAS trait. Both eCAVIAR and fastENLOC could assess the colocalization posterior probability (CLPP) for two traits at a locus, while eCAVIAR allows for multiple causal variants and fastENLOC features accountability for allelic heterogeneity in expression traits and high sensitivity of the methodology. We ran eCAVIAR between significant TWAS genes and GWAS trait with a maximum of five causal variants per locus and defined a locus as 50 SNPs up- and down-stream of the tested causal variant, following the recommendation in the original paper. The eCAVIAR was downloaded from https://github.com/fhormoz/caviar/ (accessed on 10/25/2020). The biallelic variants from the 1,000 Genomes Project phase III in European ancestry were used as an LD reference [22]. We defined CLPP > 0.5 as having strong colocalization evidence.
To run fastENLOC, we first prepared probabilistic eQTL annotations to generate the cis-eQTL’s posterior inclusion probability (PIP). Specifically, we applied the tissue-specific data from GTEx and T follicular cell-specific data from the DICE database [9] using the integrative genetic association analysis with the deterministic approximation of posteriors (DAP-G) package [23]. Then, GWAS summary statistics were split into approximately LD-independent regions defined by reference panel from European ancestry and z-scores were converted to PIP. We downloaded the fastENLOC from https://github.com/xqwen/fastenloc (accessed on 10/25/2020) and followed the guideline to yield regional colocalization probability (RCP) for each independent GWAS locus using each tissue-or cell type-specific eQTL annotation. We defined RCP > 0.5 as having strong colocalization evidence.
Functional genomics annotations
To better understand the potential function of the variants identified by GWAS analyses and how they mediate the regulatory effect, we annotated significant SNPs using publicly available data. We obtained the tissue and cellular level eQTL data from the following resources: 1) the eQTLGen consortium [24] eQTLs generated from 30,912 whole blood samples; 2) Biobank-based Integrative Omics Studies (BIOS) eQTLs generated from 2,116 healthy adults [25]; 3) The GTEx v8 [7] eQTLs of the lung, whole blood, and spleen tissues; 4) DICE database [9] with cellular eQTLs of 9 available T cell subpopulations. To identify the genomic annotation of the significant SNPs, we downloaded the multivariate hidden Markov model (ChromHMM) [26] processed chromatin-state data of 17 lung and T cell lines from the Roadmap Epigenomics project [27]. To explore the potential chromatin looping of GWAS locus, we used publicly available chromatin interaction (Hi-C) data [28] at a resolution of 40Kb on IMR90, a normal lung fibroblast cell line. The Hi-C data has been used to identify specific baits and targets from distant chromatin regions that frequently interact with each other. Variants within the regulatory regions can be connected to the potential gene targets and thus mediate the gene expression. Statistical tests of bait-target pairs were conducted to define significant bait interaction regions and their targets. The eQTL associations and chromatin-state information and Hi-C interactions were processed and plotted using the R Bioconductor package gviz in R version 4.0.3 [29].
Differentially expressed gene analysis in resident memory CD8+ T cells
We use the recently published scRNA-seq dataset of bronchoalveolar lavage fluids (BALF) samples from nine patients (three moderate and six severe) with COVID-19 [17, 30]. We adapted the original annotation [17] and followed their method to calculate the resident memory CD8+ T (TRM) cells signature score by using 31 markers (14 positive markers and 17 negative markers) for all annotated CD8+ T cells [31, 32]. We excluded cells with CD4+ expression and defined the top 50% scored cells as the TRM cells. Lastly, we conducted a non-parametric Wilcoxon Rank Sum test by the function of “FindAllMarkers” from R package Seurat (version 3.1.5 in R version 3.5.2) to perform the differentially expressed genes (DEG) analysis between moderate and severe patients.
Cell trajectory and transcriptional program analysis in TRM cells
We used the R package Slingshot [33] to infer cell transition and pseudotime from the scRNA-seq data. Specifically, we first used the expression data to generate the minimum spanning tree of cells in a reduced-dimensionality space [t-Distributed Stochastic Neighbor Embedding (tSNE) project from top 30 principle components of top 3,000 variable genes] assuming there are two major clusters (moderate and severe TRM cells). We then applied the principal curve algorithm [34] to infer an one-dimensional variable (pseudotime) representing the each cell’s trajectory along the transcriptional progression. We used our in-house machine learning tool, DrivAER (Driving transcriptional programs based on AutoEncoder derived relevance scores) [35], to identify potential transcriptional programs (e.g., gene sets of pathways or transcription factors (TF)s) that potentially regulate the inferred cell trajectory between the moderate and severe patients. To avoid the potential noise from the low expression genes, we excluded those genes expressed in < 10% cells. DrivAER took gene-expression and pseudotime inferred from previous cell trajectory results (Slingshot) and calculated each gene’s relevance score by performing cellular manifold by using Deep Count Autoencoder [36] and a random forest model with out-of-bag score calculation as the relevance score. The transcriptional program annotations were from the hallmark pathway gene sets from MSigDB [37] and transcription factor (TF) target gene sets from TRRUST [38]. To calculate the relevance score, we used the “calc_relevance” function with the following parameters: min_targets = 10, ae_type = “nb-conddisp”, epoch=100, early_stop=3, and hidden_size = “(8,2,8)”. The relevance score (R2 coefficient of determination) indicates the proportion of variance in the pseudotime explained by target genes of transcription factor or genes in the hallmark pathways.
DNA motif recognition analysis of genome-wide significant SNPs
We used the function “variation-scan” of the online tool RSAT (http://rsat.sb-roscoff.fr/index.php, accessed on 01/15/2020) [39] to predict the binding effect of all the significant SNPs at the 3p21.31 locus. We defined the TF with Bonferroni corrected p < 0.05 as the significant TF. Later, we compared them with the TF with high relevance score from the DrivAER analysis above. The position weight matrices (PWMs) for all the TFs were downloaded from cis-BP Database (http://cisbp.ccbr.utoronto.ca/) version 2019-06_v2.00) [40] and sequence logos representing motif binding sites were generated using R package seqLogo version 1.54.3 in R version 3.5.2.
Results
TWAS analysis identified and replicated two chemokine receptor genes
We utilized the latest S-PrediXcan MASHR models trained with GTEx v8 data for TWAS analyses in lung and whole blood on two GWAS datasets of susceptibility to severe COVID-19 [19]. In the HGI cohort, we found that a decreased expression of CXCR6, which encodes C-X-C chemokine receptor type 6, in the lung was associated with an increased risk for the development of severe COVID-19 symptoms (p = 1.57 × 10-17, z = −8.53), and this result was then replicated in the SCGG cohort (p = 2.84 × 10-5, z = −4.19, suggestive significant) (Figure 2 and Table 1). Likewise, an increased expression of CCR9, which encodes C-C chemokine receptor type 9, in whole blood was associated with an increased risk for the development of severe COVID-19 complications in GWASHGI cohort (p = 7.90 × 10-11, z = 6.50) and this result was replicated in the other GWASSCGG cohort, (p = 3.78 × 10-10, z = 6.26) (Figure 2 and Table 1). Whole blood and lung transcriptome models also identified two additional significant TWAS genes that are specific to one of the two cohorts. Increased expression of ABO gene in the lung was associated with risk for the development of severe COVID-19 symptoms in GWASSCGG data set (p = 5.98 ×10-7, z = 4.99). Similarly, increased expression of GAS7 gene (Growth Arrest-Specific 7) in whole blood was associated with an increased risk for development of COVID-19 symptom in the GWASHGI data set (p = 8.46 × 10-7, z = 4.92). Overall, these two chemokine receptor genes were found and replicated to be associated with COVID-19 and we used them for further downstream analyses.
Colocalization analysis validated the mediation effect of CXCR6 between GWAS locus and severe COVID-19
The TWAS findings might be driven by pleiotropy or linkage effect by the LD structure in the GWAS loci instead of the true mediation effect [41] (Figure 3A). To rule out the linkage effect and find further evidence of true colocalization of causal signals in the variants that were significant in both GWAS and eQTL analyses, we performed colocalization analysis by eCAVIAR and fastENLOC using several tissue-specific eQTL datasets. The eCAVIAR with the eQTL data in lung tissue revealed that the severe COVID-19 association could be mediated by the variants that were associated with the expression of CXCR6 (CLPP = 0.79) (Table 1). And the colocalized SNP rs34068335 (GWASHGI p = 5.02 × 10-22) is also related to the increased monocyte percentage of white cells in a blood-trait GWAS study using Phenoscanner [42-44]. The fastENLOC analysis showed a high RCP between the expression of CXCR6 in T follicular helper cells and GWAS signal in both the GWASHGI cohort (RCP=0.99) and the GWASSCGG cohort (RCP = 0.99) (Table 1). However, colocalization analysis of CCR9 did not suggest strong colocalization evidence (CLPP < 0.1 and RCP < 0.1).
Multi-level functional annotations linked 3p21.31 locus with CXCR6 and CCR9 functions
To explore the potential functions linked with the GWAS risk variants, we examined the functional genomic annotations in this locus. Specifically, we found a consistent decreasing effect of CXCR6 expression in T cells and whole blood from the two large-scaled eQTL datasets (Figure 3B). Furthermore, multiple SNPs at the 3p21.31 locus reside in the annotated regulatory elements across blood, T cell, and lung cell lines (Figure 3C, Materials and methods). The Hi-C cell line data from lung fibroblast [28] also showed a significant interaction between the 3p21.31 locus had interactions with both CXCR6 and CCR9 promoter regions (Figure 3D). Overall, these results from the multiple lines of evidence all supported the potential regulatory effects of the 3p21.31 locus on CXCR6 expression.
CXCR6 differentially expressed in TRM cells of severe and moderate patients
According to our tissue and cell type expression database (CSEA-DB), CXCR6 is mainly expressed in immune cells in human lung tissue (e.g., T cell and NK cell) [16]. In Liao et al.’s work, the authors reported that CXCR6 had lower expression in severe patients than moderate patients, indicating a potential protective effect in T cells of human respiratory systems [17]. However, T cells have various resident and circulating subtypes with diverse functions [45]. To understand which subpopulation(s) of T cells might be associated with the severity of COVID-19, we used the BLAF single cell RNA-seq data of six severe patients and three moderate patients. The data included 6,491 T-cells (4,356 from six severe patients and 2,135 from three moderate patients). We further used a set of 31 TRM cell marker genes to distinguish the TRM cells and conventional CD8+ T cells (Materials and methods). As shown in Figure 4A and 4B, the TRM cells and conventional T cells could be distinguished in both moderate and severe patients with the classic TRM cells markers (CXCR6 [31], CD69 [46], ITGAE (the gene encoding CD103) [46, 47], ZNF683 [47], and XCL1 [45]) and three negative-control markers (SELL (the gene encoding CD62L) [46], KLF2, and S1PR1 [48]) from previous study [31]. Among the 1,090 lung TRM cells, we found that 675 cells were from moderate patients and only 415 cells were from severe patients. This represented a 3.32-fold decrease for the expected number of TRM cells in severe patients. We used the non-parametric Wilcoxon Rank Sum test to identify the DEGs in the TRM cells between severe and moderate patients and found CXCR6 had significantly lower expression in the severe patients than the moderate patients (p < 2.5 × 10-16, fold change = 1.57, Figure 4C).
Inferring the transcriptional programs that drive the cell status transition
To understand the transition between moderate and severe TRM cells, we constructed the cell trajectory/pseudotime along with TRM cells by using Slingshot (Figure 4D) [33]. Next, we applied our DrivAER approach (Driving transcriptional programs based on AutoEncoder derived Relevance scores) [35] to identify the potential transcriptional programs that were most likely involved in the cell trajectory/pseudotime. Figure 4E shows a scaled heatmap to demonstrate the relative expression of naïve and effector markers of T cells in the order of pseudotime generated by Slingshot [33, 38]. We identified that the severe TRM cells were mainly gathered in the later stage of the pseudotime. The naïve markers (IL7R, BCL2) were higher expressed in moderate patients than in severe patients (except SELL). On the contrary, some effector markers (GZMB, HAVCR2, LAG3, IFNG) were lower expressed in moderate patients than in severe patients. Other effector markers (IRF4, PRF1) had higher expression in the middle of the transition than their expression at the start and end sides. These results indicated that the TRM cells in severe patients still in pro-inflammatory status although the TRM cells status were more heterogeneous in severe patients than in moderate patients (Figure 4A, 4B, and 4E). As shown in Figure 4F and 4G, the top five molecular signatures (relevance score > 0.25) identified by DrivAER included T-cell pro-inflammatory actions (interferon gamma response, allograft rejection [49], interferon alpha response, and complement system) as well as proliferative mTORC1 signaling pathway [50]. Among the top TFs (relevance score > 0.25) that drove this cell trajectory, the DNA binding RELA-NFKB1 complex is involved in several biological processes, such as inflammation, immunity, and cell growth initiated by external stimuli. The signal transducer and activator of transcription (STAT1) and its regulator histone deacetylase (HDAC1) could be activated by various ligands including interferon-alpha and interferon-gamma. In summary, the TF results are well consistent with our previous hallmark pathway findings (Supplemental Table 1 and Supplemental Table 2).
Several genome-wide significant SNPs might change the TF binding site affinity
To understand the potential TF binding affinity changes of genome-wide significant SNPs, we conducted the DNA motif recognition analysis of the seven TFs related to the transcriptional program between moderate and severe TRM cells (relevance score > 0.25, Supplemental Table 2). We identified SNP rs10490770 [T/C, minor allele frequency (MAF) = 0.097, GWASHGI = 9.53 × 10-39] and SNP rs67959919 (G/A, MAF = 0.097, GWASHGI = 8.83 × 10-39) that were predicted to alter the binding affinity of TFs RELA and SP1, respectively (Supplemental Figure 1A and 1B). Moreover, these two SNPs were in the high LD region (r2 > 0.8) with several significant lead eQTLs (SNP rs35896106 and rs17713054) of CXCR6 in whole blood (p = 5.03 × 10-37) and T follicular helper cell (p = 1.30 × 10-5) (Figure 3B). In summary, the genome-wide significant SNPs were predicted to change the binding affinity of those TFs highly related to TRM cells status transition, (Supplemental Table 3), suggesting their potential regulation of CXCR6 expression.
Discussion
In this work, we developed a multi-level, integrative genetic and functional analysis framework to explore the host genetic factors on the expression change of GWAS-implicated genes for COVID-19 severity. Specifically, we conducted TWAS analysis for two independent COVID-19 GWAS datasets. We identified and replicated two chemokine receptor genes, CXCR6 and CCR9, with a protective effect in the lung and a risk effect in whole blood, respectively. CXCR6 is expressed in T lymphocytes and essential genes in CD8+ TRM cells, mediating the homing of TRM cells to the lung along with its ligand CXCL16 [51, 52]. CCR9 was reported to regulate chemotaxis in response to thymus-expressed chemokine in T cells [53]. The colocalization analysis identified that both GWAS and eQTLs of CXCR6 had high colocalization probabilities in the lung, whole blood, and T follicular helper cells, which confirms the genetic regulation roles at this locus. At the single cell level, our DEG analysis identified CXCR6 gene had lower expression in the COVID-19 severe patients than the moderate patients in both T cells and TRM cells, supporting its protective effect identified in TWAS analysis in lung and whole blood. The expected proportion of TRM cells also decreased by 3.32-fold (Table 2). Interestingly, these findings were replicated in circulating CXCR6+ CD8+ T cells of severe and control/mild patients by flow cytometry experiment [54]. We identified the major transition force from moderate TRM cells to severe TRM cells are pro-inflammatory pathways and TFs.
From the TWAS and colocalization analysis in lung and immune cells, we successfully replicated that CXCR6 was centered in the GWAS signal at locus 3p21.31. Previous studies have reported that CXCR6-/- significantly decreases airway lung TRM cells due to altered trafficking of CXCR6-/- cells within the lung of the mice [52], which could explain a much less proportion of TRM cells in severe patients than moderate patients. The lung TRM cells provide the first line of defense against infection and coordinate the subsequent adaptive response [55]. The previous study has reported that TRM cells constitutively expressed surface receptors (PD-1 and CTLA-4) that are associated with inhibition of T cell function, which might prevent excessive activation or inflammation in the tissue niche [56].
We further used nine classic naïve markers (e.g., BCL2, SELL, TCF7, IL7R) and ten classic effector markers (e.g., GZMB, PRF1, IFNG, LAG3, PDCD1) to quantify the naïve and effector status of the TRM cells (Supplemental Figure 2). TRM cells in severe patients had a much higher median of effector marker score (0.44 in severe and 0.18 in moderate TRM cells) than TRM cells in moderate patients did, suggesting that the severe TRM cells had much higher activities in inflammation as we demonstrated in Figure 4F despite their proportion decrease. For the naïve score (Supplemental Figure 2), both moderate and severe TRM cells had limited expressions (median score: 0.028 in severe and 0.038 in moderate TRM cells). Interestingly, if we removed the lymph node homing receptor SELL [31] from the naïve markers list, we would find the median score in severe naïve markers would drop to 0 (Supplemental Figure 2). This indicated that SELL expression contributed greatly to the naïve status of TRM severe patients. Consistently in Figure 4E, we could also observe that a larger proportion of TRM cells had higher SELL expression in severe patients than in moderate patients, suggesting the TRM cells in severe patients might not in a stable cell status due to the lymph node homing signal. To this end, we hypothesized that genetically lower expressed CXCR6 would decrease the proportion of TRM cells residing in the lung through the CXCR6/CXCL16 axis [54], impairing the first-line defense. Moreover, the lower expression of CXCR6 would also lead to the ‘unstable’ residency of TRM cells in lung (Figure 4B). The TRM cells play essential roles for orchestrating the immune system, lack of which would lead to severe COVID-19 symptoms, such as acute respiratory distress syndrome, cytokine storm and major multi-organ damage [57] (Figure 5).
There are several limitations in this study. The GWASHGI dataset used in this study was HGI round 4 (alpha), which was the largest GWAS by the access date of October 20, 2020. However, it was not the currently largest GWAS meta-analysis for severe COVID-19 when we prepared the manuscript. This research field is evolving very fast, due to the urgent demand of public health. Currently, the largest GWAS HGI round 4 (freeze) contained more samples (4,336 cases/ 353,891 controls), and it included two independent datasets we used in this study. Considering that the GWASHGI dataset included ~10% control samples from the Asian population, we checked the LocusZoom plot of the chr3: 45.80-46.40 million base pairs (Mb) region on GRCh37. We found a consistent tendency in GWAS round 4 alpha and freeze version (Supplemental Figure 3). Another limitation is that the scRNA-seq data only had nine COVID-19 patient samples (six severe and three moderate samples), which might not provide enough statistical power at the sample level as it is commonly considered each scRNA-seq data acts like a population. Currently, the disease-context data from sample is still very limited due to lack of BALF tissues from COVID-19 patients. It is still unexplained how the colocalized SNP rs34068335 of CXCR6 might be related to the increased monocyte percentage of white cells and whether it could be related to the severity of COVID-19 [42]. Finally, the TF binding site affinity alterations were assessed based on computational prediction, therefore, the in vivo effects require experimental validation. We anticipate more and larger datasets will be released in the near future. We will apply our integrative analysis approach to such new data.
Conclusions
Our work systematically explored the genetic effect on gene expression at chromosome locus 3p21.31 and pinpointed the gene CXCR6 might be involved in the severity of COVID-19.
Several genome-wide significant SNPs were within the LD block of CXCR6 eQTLs in immune-related cells. In a scRNA-seq COVID-19 BALF dataset, we characterized that CXCR6 (TRM cells marker gene) had a lower expression in severe patients than in moderate patients. Moreover, the TRM cells in severe patients had a 3.32-fold proportion decrease and much higher pro-inflammatory activity than TRM cells in moderate patients. Based on these observations, we proposed a potential mechanism on how the lower expression of CXCR6 regulated by the endogenous factors could progress to severe COVID-19 outcomes.
Supplemental Figure 1. Sequence logos representing DNA binding site generated from position weight matrix (PWM) for transcription factor RELA (A) and SP1 (B). SNPs rs10490770 (T/C) and rs67959919 (G/A) are predicted to have the strongest impact on their sequence (GTGGATTTTCA - Reverse strand, p = 9.8 × 10-4 and TACCCGCCGG - Reverse strand, p = 9.3 × 10-4) by utilizing the RSAT online tool (Supplemental Table 1). The polymorphism site within the transcription factor binding sites is highlighted in the red box.
Supplemental Figure 2. Violin plots showing the distribution of key features between moderate and severe patients. We calculated the proportion of the effector and naïve T cell markers in each cell. The naïve markers include BCL2, SELL, KLF2, CCR7, TCF7, LEF1, ID3, BACH2, and IL7R. The effector markers include GZMB, PRF1, IRF4, IFNG, TNFRSF9, PDCD1, LAG3, HAVCR2, TOX, and NR4A2. Median scores were of each category were list on the top of each violin plot. The “*” on the third column denotes the naïve markers without SELL gene. The “TRM sig score” was calculated from the 31 TRM signature genes described in Materials and methods. The M and S represent moderate and severe patients, respectively.
Supplemental Figure 3. LocusZoom views for Host Genetics Initiates GWAS round 4 (alpha) (A) and Host Genetics Initiates GWAS round 4 (freeze) for the 3p21.31 locus on GRCh37 (B). The x-axis represents the chromosome position in million base pairs (Mb) and y-axis represents the -log10 (p-value) from the two GWASHGI datasets. The color indicates the strength of linkage disequilibrium to the lead SNP rs35081325.
Acknowledgements
We appreciate Drs. Teng Liu and Dawei Zou for the valuable comments. We thank all members of the Bioinformatics and Systems Medicine Laboratory for the discussion. Dr. Zhao was partially supported by National Institutes of Health grant R01LM012806 and Chair Professorship for Precision Health funds. We thank the technical support from the Cancer Genomics Core funded by the Cancer Prevention and Research Institute of Texas (CPRIT RP180734). The funders had no role in the study design, data collection and analysis, decision to publish, or preparation of the manuscript.