STAT3-mediated allelic imbalance of novel genetic variant rs1047643 and B cell specific super-enhancer in association with systemic lupus erythematosus

Mapping of allelic imbalance (AI) at heterozygous loci has the potential to establish links between genetic risk for disease and biological function. Leveraging multi-omics data for AI analysis and functional annotation, we discovered a novel functional risk variant rs1047643 at 8p23 in association with SLE. This variant displays dynamic AI of chromatin accessibility and allelic expression on FDFT1 gene in B cells with SLE. We further found a B-cell restricted super-enhancer (SE) that physically contacts with this SNP-residing locus, an interaction that also appears specifically in B cells. Quantitative analysis of open chromatin and DNA methylation profiles further demonstrated that the SE exhibits aberrant activity in B cell development with SLE. Functional studies identified that STAT3, a master factor associated with autoimmune diseases, directly regulates both the AI of risk variant and the activity of SE in cultured B cells. Our study reveals that STAT3-mediated SE activity and cis-regulatory effects of SNP rs1047643 at 8p23 locus are associated with B cell deregulation in SLE.


Introduction 26
Super-enhancers (SEs) are recently discovered large domains of clustered enhancers 27 (1, 2). The extraordinary feature of SEs is the extremely high and broad enrichment of Signal transducer and activator of transcription 3 (STAT3), as one of seven STAT 35 family members, is activated by phosphorylation at tyrosine 705 (Y705) and/or at 36 serine 727 (S727) (5). After import to the nucleus, the phospho-STAT3 (pSTAT3) 37 modulates gene transcription by binding its target sequence (6). STAT3 has gained 38 broad attention because it plays a key role in a variety of pathophysiological immune 39 responses related to lymphocyte development and differentiation, and in other cellular 7 confidence/quality by depth < 2, (4) strand bias score > 50, (5) genotype score < 15 114 and (6) read depth < 8. Then, we extracted SNPs annotated from dbSNP (Build 150) 115 that were called as heterozygotes for each sample. For a reasonable comparison, those 116 heterozygous SNPs identified at least triple in both case and control samples were 117 retained. Using allelic ratio (20) as a response variable in linear regression model (see 118 below), we conducted AI analysis on chromatin accessibility for each heterozygous 119 SNP in a comparison between SLE and controls. 120 Allelic ratio ~ α + β * disease + ε 121 The p-values and beta coefficients were calculated to estimate the significance of the 122 association, and the differences between cases and controls, respectively.

124
Association analysis 125 For genotype data from a SLE case-control study in Hispanic population, all typed 126 SNPs in chromosome 8 were extracted for imputation using TOPMed Imputation 127 8 Super-enhancer annotation 135 We downloaded whole-genome chromatin state segmentation data (core 15-state  Analysis of eQTL data 142 We collected eQTL data sets from three large-scale studies, the Genotype-Tissue  For other genome-wide Hi-C (Accession ID: GSE113405) and capture Hi-C (CHi-C) 153 datasets (Accession ID: GSE81503 and E-MTAB-6621), we used the Hi-C Pipeline 154 (HiCUP) (33) to truncate and align reads to the human reference genome. The 155 deduplicated data were then processed using the Homer pipeline (34) to call the 9 significant chromatin interaction at 10 kb resolution. The resulting interactions were 157 visualized using UCSC Genome Browser or Sushi package in R environment.  198 The PCR primers are shown in Table S3. Data were presented as mean ± SD of three replicates unless stated otherwise.

202
Correlation analysis was performed using Pearson's correlation coefficient. The 203 differences were considered statistically significant at two-sided P-values less than  (Table S1). Of eleven studies, seven are SLE case-control 211 studies with data across three immune cell types including B cells, T cells and 212 Neutrophils (Table S2). Also included in the present study were SNP microarray data 213 from a SLE GWAS study (n = 2,279). We next designed a two-stage study ( Figure 1) to identify putative SLE-associated 218 functional variants. In stage I, also termed as the discovery stage, two chromatin 219 accessibility (ATAC-seq) data sets (Accession ID: GSE118253 and GSE71338, Table   220 S1) comprised 49 samples were analyzed. We focused on those variants displaying 221 difference in AI of chromatin accessibility at heterozygous SNP sites in a comparison 222 between SLE and controls (see Methods in detail). From the reciprocal validation 223 between two data sets, SNP rs1047643 was identified to show the significant AI in B 224 cells from patients with SLE, relative to controls ( Figure 2A). Interestingly, in B cells 225 at different stages, the allelic preference of chromatin accessibility for this 226 SLE-associated SNP is alterable. For example, the T allele exhibits more preferential 227 chromatin accessibility in activated B cells from patients, relative to the C allele.

228
However, the direction is reversed in SLE naive B cells.  Analyzing RNA-seq data (Accession ID: GSE118254), we determined the AI of RNA 233 transcripts for the rs1047643. In line with results shown above, we observed the 234 dynamic AI pattern on the transcriptional level for the rs1047643 ( Figure 2B).

235
Meanwhile, this dynamic allelic expression pattern is specific during B cell 236 development with SLE ( Figure 2C).  Figure 3E, where one SNP rs2736336 is excluded due to its multivariate 246 alleles). Of the 12 index SNPs, indeed, one index SNP rs17807624 with the statistical 247 significance with P < 1.5 × 10 -3 using the univariate analysis, is the top signal to 248 which the SNP rs1047643 is conditional. Thus, we performed haplotype analyses on 249 these two SNPs (index SNP rs17807624 and rs1047643, Figure 3B). Compared with 250 the reference haplotype, which carries the alleles associated with a reduced risk in two 251 SNPs, haplotype 2, which carries the risk-associated alleles, showed a significant 252 association (adjusted P = 0.03).  Figure 3D). Interestingly, besides correlated with three 258 adjacent genes (FDFT1, CTSB and NEIL2), the rs1047643 is also an eQTL linked 259 with an upstream BLK gene in a distance of ~300 kb, a result that is detected in two 260 independent data sets. An analysis of RNA-seq data from two independent studies 261 (Accession ID: GSE118254 and GSE92387, Table S1) consistently showed that 262 expression patterns for two representative genes (BLK and FDFT1) are gradually 263 increased in a developmental process from naive to memory B cells, in particular, the 264 double negative memory B cell subset in patients with SLE, the pattern that is not 265 observed in controls ( Figure S1 and S2). By searching for enhancers and other regulatory elements across 8p23 locus from a 268 dataset of the 127 epigenomes from Roadmap, we identified a SE with a length of 7kb 269 in the upstream of BLK gene in CD19+ B cells (Epigenome ID: E032, Figure 3E). An  Figure 3C).

275
Analyzing Hi-C data sets from two independent studies in GM12878 cells, we 276 observed a DNA looping between the SNP rs1047643 within FDFT1 and the SE 277 region ( Figure 3F). More importantly, in GM12878 B-lymphoblastic cells, this SE 278 region has a wealth of long-range interactions with adjacent genes (e.g., BLK) and  in non-active B cells from SLE patients, relative to healthy controls ( Figure S5).

306
We further conducted quantitative analyses on ATAC-seq data from another two 307 independent studies in two immune cell types, T cells and neutrophils (Accession ID: 308 GSE139359 and GSE110017, Table S1). The results showed that there was no 309 marked enrichment of ATAC-seq reads on both the SE and FDFT1 promoter regions 310 in these two immune cell types for both SLE and controls ( Figure S6). Collectively, 311 these results suggest a B cell specific, rs1047643-interacting SE whose activity is 312 aberrant in SLE B cell development.   We next tested whether the STAT3 might also regulate the rs1047643-residing 347 regions. Using allelic qPCR assay, we confirmed that genomic DNA in the GM11997 rs1047643. The association study shows that the rs1049643-T is a risk allele for SLE.

375
Our AI analyses indicate that the rs1049643-T allele resides in more open chromatin 376 state and has higher expression in SLE memory B cell subsets, relative to the C allele.

377
Functional study further provides evidence that the rs1049643-T allele is 378 preferentially bound by STAT3. The SNP rs1047643 is also an eQTL linked with 379 both proximal and distal genes, including BLK, the gene that plays a critical role in B 380 lymphocyte development (37). These results demonstrate that this novel 381 SLE-associated risk rs1047643 whose functionality is mediated by STAT3, may play 382 a role in allele-specific control of adjacent genes at 8p23 locus in B cells. Despite no 383 report for association with other autoimmune diseases, this SNP has been associated 384 with multiple myeloma (38) and follicular lymphoma (39), two malignant diseases 385 whose pathogenesis is partially associated with the dysfunction of B cells.

386
Specifically, hyperactive STAT3 has been reported to be associated with poor 387 survival in both diseases (40,41). Therefore, our findings may provide a clue for 388 genetic and mechanical studies on those B cell associated diseases.     443 The authors declare no competing financial interests.           Next-generation genotype imputation service and methods. Nat Genet.     Table S1. Summary of data sets used in the study. Functional genomics data sets, 689 including ATAC-seq, RNA-seq and RRBS-seq data sets from seven SLE case-control 690 studies (Table S2), and Hi-C data sets in multiple cell lines, and a SNP microarray 691 data set from a lupus GWAS study.    arrow) on 8p23 locus from CHi-C data with duplicates in two types of normal T cells.

709
Orange arrow represents the location of super-enhancer identified in this study. This txt file contains source data used for the quantitative analyses shown in Fig. 4.