Widespread methylation quantitative trait loci and their role in schizophrenia risk

DNA methylation (DNAm) regulates gene expression and may represent gene-environment interactions. Using whole genome bisulfite sequencing, we surveyed DNAm in a large sample (n=344) of human brain tissues. We identify widespread genetic influence on local methylation levels throughout the genome, with 76% of SNPs and 38% of CpGs being part of methylation quantitative trait loci (meQTLs). These associations can further be clustered into regions that are differentially methylated by a given SNP, highlighting putative functional regions that explain much of the heritability associated with risk loci. Furthermore, some CpH sites associated with genetic variation. We have established a comprehensive, single base resolution view of association between genetic variation and genomic methylation, and implicate schizophrenia GWAS-associated variants as influencing the epigenetic plasticity of the brain. One-sentence summary Most genetic variants associated with DNA methylation levels, and implicated schizophrenia GWAS variants in the human brain.


Introduction
DNA methylation (DNAm) plays an important role in the epigenetic regulation of gene expression. It varies throughout development and among tissue types, and has been thought to be a high-fidelity representation of the interaction between genes and environment. While some variation in DNAm can be attributed to developmental and exogenous factors, including diet ( 1 ) and cigarette smoking ( 2 ) , Davies et al. ( 3 ) identified some inter-individual variation that is consistent across tissue types. This provided evidence that DNA sequence drives DNA methylation levels, at sites known as methylation quantitative trait loci (meQTLs). Inter-individual DNAm differences have since been confirmed by twin studies ( 4 , 5 ) . Initial studies found methylation association with sequence variants at specific loci ( 6 ) . These putative epigenetic associations extend beyond the role of genetic variation in ablating CpG dinucleotides across evolution through deamination of the cytosine base in this genomic context ( 7 ) .
Genome-wide studies are necessary to fully understand the extent and genomic properties of meQTLs. However, so far most large-scale studies have used microarray technologies that only measure a small proportion of CpGs ( 8 -10 ) . The largest study to date to test associations between genotype and DNAm used MBD-seq, a method lacking single base pair resolution ( 11 ) . Yet even with limited resolution, these initial studies have found that local genetic influence on DNAm is extensive throughout the genome, and meQTLs are enriched at regulatory sites ( 12 , 13 ) .
Currently, a major puzzle in the field of functional genomics is understanding the molecular effects of genetic risk loci and variants identified by genome-wide association studies (GWAS) for many common disorders and traits. This is particularly challenging in tissues like brain that are difficult to access or model, leaving little clarity into genetic mechanisms behind psychiatric disorders such as schizophrenia. While schizophrenia is highly heritable ( 14 ) , and GWAS have identified a growing number of significant loci ( 15 , 16 ) , only few loci have been functionally resolved ( 17 ) . Genome-wide gene expression QTL (eQTL) approaches ( 18 , 19 ) , and their extensions ( 20 -22 ) , have prioritized variants and associated genes, but many genomic loci fail to associate with nearby gene expression. In contrast, associating schizophrenia risk variants with a stable epigenetic mark like DNAm provide clues for potential epigenetic mechanisms of action ( 23 , 24 ) . Indeed, previous meQTL maps using microarray technologies implicated a larger number of SCZD risk loci than eQTL maps, while only measuring a fraction of the methylome ( 10 ) . DNAm itself may further reflect the cumulative effects of environmental exposures across the lifespan ( 25 ) , and may represent a surrogate of "E" in GxE interactions that contribute to risks for many disorders ( 26 ) that can further act as a mediator of genetic risk on gene expression.
Unlike microarray technologies, whole genome bisulfite sequencing (WGBS) has the advantage of measuring cytosine methylation at single base pair resolution, as well as measuring CpH (H = A,T, or C) DNA methylation levels (in addition to CpGs). While CpH sites are generally unmethylated in somatic tissues, neurons in the human brain have uniquely high levels of CpHm ( 27 ) . By leveraging this technology, we have created the most extensive genomic meQTL map in human postmortem brain tissue to date, and use this information to fine-tune our understanding of the molecular mechanisms of genetic and epigenetic risk for schizophrenia .

Components of global variation in large-scale WGBS datasets
We performed whole genome bisulfite sequencing (WGBS) to gain a comprehensive view of genetic influence on DNAm in the adult human brain using two brain regions: the hippocampus and the dorsolateral prefrontal cortex (DLPFC). These regions have been prominently implicated in the pathogenesis of many psychiatric disorders, particularly schizophrenia ( 28 ) .
After data processing and quality control (see Methods), we analyzed 165 DLPFC samples and 179 hippocampal samples from a total of 183 adult donors aged 18 to 96 years (161 donors had data from both regions, Table S1 ). Data were generated across two large diagnosis-and region-balanced batches. We assessed 29,401,795 CpG sites across the epigenome, with an average post-processing coverage of 17.3 reads per CpG site. 78% of sites were highly (>80%) methylated while a minority, were lowly or unmethylated (8% are <20%). While the technical effects of measuring DNAm levels using microarrays is well-established ( 29 ) , particularly in human brain tissue ( 10 ) , corresponding assessments using WGBS data have been limited due to available comparisons being relatively small studies. We therefore assessed the contributions of different biological and technical variables on genome-wide CpG DNAm levels measured with WGBS.
The major batch effects resulted from inclusion of ENCODE "blacklist" regions ( 30 ) , which have been reported to cause problems with mapping and alignment in high-throughput sequencing data, particularly epigenomic data. These processing issues are likely further exacerbated in WGBS data, where the bisulfite treatment results in lower complexity libraries depleted of cytosines, which presumably relates to the influence of blacklist regions and ancestry on DNAm levels. Cytosines in these black-listed regions were therefore removed from reported site-specific analysis results. Another increasingly-common step in WGBS data processing involves "smoothing" local CpG DNAm levels within each sample to improve precision and borrow strength across nearby CpGs ( 31 ) . Smoothing reordered the top components of variation ( Figure S2 ), and resulted in the top component of variation representing both batch and estimated neuronal fraction (both PC1 and PC2, explaining 13.9% and 10% of variance, respectively). Previous analyses of Illumina microarray-derived adult homogenate DLPFC data suggested that estimated neuronal fraction was the largest component of (CpG) DNAm level variation ( 10 ) . Microarray technology implicitly produces somewhat smooth DNAm levels for a large fraction of probes that target multiple CpG sites. While the effects of brain region were further magnified with smoothing, the effects of quantitative ancestry were dampened, and became associated with PC4 (1% variance explained) rather than PC1 of raw DNAm levels (6% explained variance; Figure 1(A-E) ).
We further complemented these global analyses using site-specific variance components analysis, estimating the percentage of smoothed DNAm level variance explained by technical and biological components at each autosomal CpG, excluding the blacklist (N = 26,416,185, using ANOVA, see Methods, Figure 1F , Data S1 ). The factors that explained the largest components of site-specific variation were technical batch (median: 9.6% variance explained, interquartile range 3.4-19%), and related flow cell (7.9%, 6.3-9.6%) and instrument (2.9%, 2.1-3.2%) variables, as well as the more biological brain region (1.5%, 0.3-5.9%), estimated neuronal fraction (1%, 0.2-3.8%) variables. Traditionally-considered confounders in postmortem human brain studies -including tissue pH and postmortem interval (PMI) -had very little influence on site-specific DNAm levels using WGBS (in line with previous microarray-based analyses ( 10 ) ). For example, pH and PMI explained more than 1% of variance across only approximately 5% of measured sites. Other technical variables hypothesized to influence DNAm levels like the sequencing alignment rates and bisulfite conversion rates (estimated with spike-in sequences, see Methods) showed little influence in this analysis. Overall, there was extensive residual variation of DNAm levels for the majority of sites beyond these technical and biological variables.
Local genetic variation has strong effects on CpG DNA methylation levels We hypothesized that a large component of this residual DNAm variation was likely captured by local genetic sequence. We therefore performed genome-wide methylation quantitative trait locus (meQTL) analyses (see Methods) on smoothed DNA methylation levels in each brain region separately (across 29,401,795 CpG sites). In the DLPFC, we computed meQTLs between each of these CpGs and the subset of common SNPs within 20kb upstream and downstream , which identified 341,597,218 significant SNP-CpG pairs (at FDR < 0.01, Data S2 ), representing 5,971,724 (76%) of the tested SNPs and 11,155,961 (38%) of tested CpGs. Given high genomic correlation among both CpGs and SNPs, we performed the same analysis with a set of 535,859 LD-independent SNPs (R 2 < 0.2) to reduce the potential effects of linkage disequilibrium (LD) inflating these statistics. This sensitivity analysis found a similar number of CpGs identified as meQTLs (8,390,092 CpGs, 29%) with similar properties. Most SNPs associated with methylation levels at many nearby CpG sites (mean = 57 CpGs, median = 43 CpGs), and the methylation-associated SNPs had varying genomic widths of effect in this local window ranging from 1bp to the full 40kb (mean = 14.5kb , median = 12.7 kb). Effect sizes were generally small, with a mean of 2.6% change in methylation level per allele (IQR: 1.4%-3.1%), but ranging up to 47%. In both analyses we found that SNPs that disrupt a CpG dinucleotide (i.e. a variant at the C or G, which would ablate the capacity for methylation) have a slightly but significantly lower width of effect and a slightly higher number of correlated CpGs, meaning they have a higher effect density. This may be attributed to the fact that CpGs tend to cluster in the genome. Additionally, despite tested SNPs being LD-independent in the second analysis, we find that half of CpGs associate with more than one SNP, with a mean and median of 2, in line with previous observations in gene expression data ( 18 , 32 ) . Using information from the Roadmap Epigenome ( 33 ) , most genetically-associated CpGs were in quiescent genomic regions, and depleted for enhancer regions, in human brain ( Figure 2A ).
Analyses on LD-independent SNPs yielded a similar proportion of SNPs ( 403,373 SNPs, 77%) implicating a similar number of CpGs as seen in the DLPFC (8,566,898). Effect sizes were similarly small, with a mean of 2.6% change in methylation level per allele. Hippocampal meQTLs have similar width of effect as those in DLPFC with a mean of 15,661 bp and an average of 60 CpGs associated with a SNP. These analyses suggested that the global properties of meQTLs were highly similar across brain regions.
We performed a series of secondary analyses to better characterize the determinants of such extensive genetic regulation of DNA methylation. First, due to the mixed ethnicities of our samples, and the potentially large differences in allele frequencies between ancestry groups ( 34 ) , we ran post-hoc meQTL analysis on a subset of meQTLs identified in full genome analysis in the DLPFC, separating samples into two groups by self-reported race. African Americans (AAs) made up 67% of our total sample, and thus were more likely to drive the results. Using significant meQTLs on chromosome 1, analyses using only African American samples (N=112) showed that 99.97% of the meQTLs were directionally consistent with the full analysis, with 97% marginally significant (p < 0.05) and 83% genome-wide significant (FDR < 0.01) in the smaller sample size and an overall sharing of = 0.997. In the European ancestry samples (N=53), 97% of meQTLs were directionally consistent, with 57% marginally significant and 21% genome-wide significant with an overall sharing of = 0.854. These decreased proportions compared to AA-specific analyses at least partially relate to the smaller sample size and resulting decreased statistical power ( Figure S3 ). We also found that in general, differences in minor allele frequencies across ancestry groups did not associate with differences in meQTL effect magnitude ( Figure S4 ), indicating that differences in ethnicity composition of our samples were likely not driving our combined ancestry analyses above. We further explored the robustness of the selected meQTL window size (20kb) using heritability analysis (see Methods) on the methylome ( 35 )  Schizophrenia risk-associated SNPs on average associated with 172 CpGs (median = 104), and in this cis window had an average genomic width of effect of 177kb (median = 147kb).
We further performed functional validation of these associations using corresponding gene expression data. Using RNA-seq data from the same regions and donors (see Methods), we assessed whether methylation at these CpGs correlated with neighboring expression levels.
Using previous eQTL analyses on these same PGC loci ( 18 , 39 ) , we assessed the mediation of expression by mCpG (see Methods). Eleven of 127 loci had a correlation between gene expression and the methylation with which they are associated. Importantly, 10 of these associated with at least one CpG that mediated expression by at least 25%. The same analyses on the exon and junction levels picked up subtler effects, detecting 18 and 27 loci mediating expression levels via methylation, respectively. We found that overall, methylation mediation was most potent on the exon level (median = 40%), then the junction level (median = 32%), and least potent on the full gene level (median = 23%), in line with the putative role of DNAm in promoting gene splicing ( 40 ) .
The same meQTL analysis was performed in the hippocampus WGBS data, revealing 48,023 significant SNP-CpG pairs ( Table S3 ), representing 139/153 tested SNPs. There were 15,119 trans -meQTLs, many more than in DLPFC. Within the subset of significant DLPFC meQTLs, hippocampal meQTLs had an overall sharing of = 0.97, indicating that our findings are very consistent between brain regions.
These results indicated that meQTL effects, at least in the context of GWAS associations with schizophrenia, have much broader effects than traditionally considered, and much wider than the 20kb window examined at the full genome level. In order to see if schizophrenia-associated meQTLs are comparable to non-disease-associated meQTLs, we took 5000 random SNPs representing all levels of MAF and ran meQTL analysis with a 250kb window. Again we found that the majority of SNPs (93%) are meQTLs. We also find that neither MAF nor population-MAF differences associate with any meQTL characteristics. Interestingly, we found that these random meQTLs had significantly lower width of effect than schizophrenia-associated meQTLs in both regions (DLPFC p = 8.9e-5, Hippocampus p = 1.1e-11), and a significantly fewer number of affected CpGs in hippocampus (p = 0.002). This combined with the chromatin state enrichment analysis below may indicate that these PGC-meQTLs are particularly functional, and potentially involved in disease processes, as opposed to just being standardly representative of the whole genome.

Risk-associated meQTL effects cluster in the genome
We then proceeded to cluster our meQTL CpGs into differentially methylated regions (DMRs) for better functional characterization. Using a CpG-specific t-statistic cutoff of 5 (see Methods), these sites could be clustered into 1277 DMRs ( Figure 3 , Table S4 ). The majority of SCZD index SNPs had such DMRs, and most had more than one (mean = 9.5, median = 6). The overall span of effect for each SNP was much larger than the 20kb cis window we tested above for meQTL analyses across the full genome, ranging up to 240 Mb on a single chromosome, with a median of 95 kb (mean = 17.5 Mb). Using Roadmap Epigenome ( 33 ) data, these SCZD risk-associated DMRs were enriched over the background of genome-wide LD-independent meQTLs for transcriptional and weak transcriptional chromatin signatures ( Figure 2B ). They were also comparatively depleted for weak repressive polycomb and quiescent chromatin signatures. Overall, these DMRs were in or near genes enriched for GO terms related to synapse and membrane potential ( Figure S5 ). 20 DMRs overlapped with psychENCODE enhancers, and 142 overlapped with promoter regions. The genes connected to these promoters were enriched for GO terms related to acetylcholine, ion channels, and neurotransmitters.
Overall, results of clustering meQTL CpGs were quite similar between regions. In the hippocampus there were 1408 DMRs ( Table S5 ) . The were generally larger and more distant from each index SNP than the , with the majority of distal (in trans ). We further divided these DMR sets into those cis ( and ) and trans ( and ) relative to the PGC loci, i.e. those that were within the GWAS LD blocks, and those outside these blocks ( Figure 4 ) SNPs and 63% of these CpH sites were also meQTLs in the DLPFC. Similarly, CpH-meQTLs had much larger effect sizes (mean = 29%) and most represented CpH sites (58%) were inside genes. These effects in each brain region presumably represent neuronal-specific genetic regulation of DNAm levels.
We also performed more focused CpH-meQTL analyses on the PGC SNPs described above and found 1444 significant cis -meQTLs and 48 trans -meQTLs in the DLPFC ( Table S6 ). Again, a majority of PGC SNPs were represented (141/152). Some of these CpH sites were near CpG-meQTLs, but many were not (mean distance = 120kb, median = 2798bp), suggesting potential independent effects of genotype on different sequence contexts of DNAm. Like with CpGs associated with genotype, CpHs meQTLs were also enriched for transcriptional and weak transcriptional chromatin states over full genome CpH-meQTLs, and depleted for repressor polycomb and quiescent states ( Figure 2C) ( 33 ) . Most CpHs were inside genes that were subsequently enriched for neuronal GO terms related to neurons, synapses, and channels, further validating the neuronal contribution of CpH DNAm levels. We similarly observed much larger effect sizes of risk alleles in CpHs compared to CpGs in line with genome-wide analyses above, with a mean of 27% compared to 2% respectively. In hippocampus, we found 1588 cis -CpH-meQTLs and 92 trans -CpH-meQTLs ( Table S7 ), representing 148/153 tested SNPs.
Similar to all previous analyses we see that these sites are mostly inside genes and have much larger effect sizes than CpGs. The genes represented by these CpHs are enriched for GO terms related to neuronal anatomy, synapses and IP3. Again, distance to the nearest CpG-meQTL is highly variable, ranging from 1 to 4,217,747 bp (mean = 24,374, median = 2,761). Results were overall similar between both brain regions, and 1219 CpH-meQTLs were in common between both regions, though again, there were unique associations across regions.

Age associations to DNAm levels
While there was extensive evidence of meQTLs in our WGBS data, there were a subset of CpG sites that showed high percentages of variance explained by age ( Figure 1F ). We therefore more formally modeled methylation over age in both brain regions, as DNA methylation has been shown to globally accumulate with age ( 47 , 48 ) . We found extensive association with age, across approximately 2 million CpG sites in each region (at FDR < 0.05, Data S6 ) . The majority of these sites were age-associated in both regions, with a sizable fraction of sites showing some regional specificity (700,000 sites in the DLPFC and 800,000 sites in hippocampus). The majority (94%) of sites increase in methylation with age, with half of sites in promoter regions, and a quarter in CpG islands or shores. Only 9% of genes represented by these differentially methylated promoters had significant correlation to gene expression levels in these samples (see Methods). In contrast, there was very little CpH association with age, with only 5,136 and 445 significant sites in hippocampus and DLPFC respectively (at FDR < 0.05). These results suggest that CpH methylation may be more stable across adulthood and aging after establishment in postnatal life.
Given the large extents of meQTL-and age-associated sites, we asked whether any CpG sites showed dynamic meQTL effects across the adult lifespan. Despite age being associated with methylation at many sites throughout the genome, we found there were practically no changes in meQTL effects across the adult lifespan (i.e. statistical interaction between age and genotype), and, if anything, sites that were differentially methylated by age were depleted (p < 2.2e-16) for being associated with local genetic variation (i.e. being meQTLs).
Minimal illness-state associated differential methylation levels We lastly modeled methylation differences between patients with schizophrenia and neurotypical controls. These associations are typically more subtle -fewer with smaller effect sizes -than age or genotype effects in microarray data ( 10 ) and more likely to represent cohort-or dataset-specific findings ( 49 ) . In these WGBS data, we found very few FDR-significant CpG sites -none in DLPFC and 70 in hippocampus. This is not surprising based on previous studies and the high multiple testing burden, almost two orders of magnitude more than microarray platforms. We saw similar lack of signal at CpH sites, with no significant hits in DLPFC and 1293 in hippocampus, with most (70%) of the hippocampal hits being in or nearby genes. These results suggest rather subtle effects of schizophrenia diagnosis on the methylome, particularly in the contexts of much stronger genotype-and age-associated effects.

Discussion
Here we present one of the most comprehensive whole genome bisulfite sequencing (WGBS) studies to date, particularly in human brain tissue, to better understand technical and biological factors that contribute to genome-wide DNA methylation levels at both CpG and CpH sites. We first demonstrated, at a single base pair resolution, that meQTLs are highly abundant throughout the entire genome at a breadth and scope previously unseen. Not only can common SNPs associate with CpG methylation, but they also uniquely and independently associate with CpH methylation levels in adult neurons. Furthermore, we demonstrated clinical relevance of these single base resolution meQTL maps to identify the functional significance of loci identified by GWAS in human brain. Using schizophrenia as an example, we found DNA methylation associations to nearly every genome-wide significant variant that clustered into many local differentially methylated regions (DMRs) that explained significant proportions of disease heritability.
Due to the expense and computational intensity, WGBS is challenging for epigenomic studies.
With our large scale study, we were able to identify the effects of technical and potential biological variables on our data. This has been less well characterized than microarray studies, and we found that batch and ancestry cause much variance in the data, and their effects are exacerbated and alleviated, respectively, by the smoothing process. We also found that ENCODE blacklist regions are unreliable in WGBS data, due to the increased difficulty of alignment ( 30 ) .
Previous studies have identified a genomic presence of meQTLs, but not at a single base pair resolution. Our findings are largely consistent with previous work, in that meQTLs are indeed extensive throughout the genome, and that most of their regulation occurs locally. However, while earlier estimates reported that 15% of CpGs were under genetic control ( 11 ) , we greatly increased this fraction to 38%. Like Smith et al. ( 8 ) , we showed that overlap was generally high between the two brain regions we surveyed, though clearly, there are differences as well.
Studies have also found that functional meQTLs are enriched for active chromatin states ( 11 ) and that meQTLs appear to impact alternative splicing ( 12 ) , further agreeing with our results and supporting the idea that schizophrenia risk associated loci are functional meQTLs. With our large sample and high genomic breadth, we are able to expand on all of these earlier findings at an in-depth genomic level.
These results further implicate DNAm as perhaps the most proximal molecular correlate of DNA sequence variation. The most comprehensive eQTL resource constructed in brain tissue, using over 1400 individuals, identified that approximately 25% of common genetic variants associated with nearby gene expression levels ( 50 ) and our meQTL maps here implicated three times as These meQTLs further refined our understanding of the functional significance of schizophrenia genetic risk loci. By leveraging WGBS data combined with genotype data from the same samples, we identified molecular phenotypes associated with individual risk variants. This process could more generally filter GWAS findings to regions of the genome that could impart functional consequences of these risk variants. We found that regions that are differentially methylated by risk-associated genotypes explained most of the heritability imparted by the genome-wide significant schizophrenia risk loci, despite spanning a much smaller fraction of the genome (1.6Mb vs 56.5Mb). We also found that for some of these risk loci which have been previously identified as eQTLs ( 39 ) , DNA methylation acts as a mediator of eQTL effects, refining the potential mechanism by which genetic risk variants may affect brain function. We note that the strongest mediation effects were seen among exons, indicating that differences in methylation may be key to alternative splicing, as has been previously hypothesized ( 51 ) . While our data do not show that DNAm mediates expression for the majority of the meQTLs, this must be viewed with some caution. Our brain samples represent a moment in time in the lifespan of any given brain, and the data are from bulk tissue. At different life stages, perhaps in specific cell populations, mediation effects may be more prominent, particularly in the developing brain ( 28 ) .
WGBS also gives the unique ability to examine CpH methylation, an often overlooked mark, particularly in brain. We found that DNAm levels at specific CpH sites were also associated with genetic variation, which presumably reflected neuron-specific genetic regulation of DNAm levels. It is interesting that the genetic control of CpH methylation seems to have a much larger effect size than that on CpGm, particularly given the fact that the fraction of neurons in our homogenate tissue were uniquely driving these associations. This mark is particularly interesting to examine in relation to psychiatric disorders because it is specific to neurons, so we Overall we have established a comprehensive landscape of genetic control of genomic methylation in the human brain. Based on previous findings that many meQTLs are stable across tissue types, a large fraction of this meQTL map could apply to other tissues and cell types. It is clear that genotype has a robust role in determining local methylation levels, not only at CpG sites but at CpH sites as well. These findings can further be applied to understand the functional significance of genetic risk loci identified in GWAS.

Study Samples
Brain

Data Generation
Genomic DNA was extracted from 100 mg of pulverized dorsolateral prefrontal cortex (DLPFC, corresponding to BA46/9) or hippocampus tissue (dissected as previously described ( 39 )  For PGC analyses, we set the cis distance to 250kb, and considered everything else in trans.
We set the p-value cutoff to 1 so that we had statistics for every SNP-CpG pair in this analysis. Benjamini-Hochberg approach ( 68 ) ).      meQTL statistics compared to t-statistics from all samples. Figure S4 : Lack of association between population MAF differences and meQTL strength.
Though some differences in meQTL strength can be seen between ethnicities, meQTL strength does not associate with SNP MAF or MAF differences between ethnic populations.   Tables   Table S1: Sample Demographic Data