DNA methylation covariation in human whole blood and sperm: implications for studies of intergenerational epigenetic effects

Background Epidemiological studies suggest that paternal obesity may increase the risk of fathering small for gestational age offspring. Studies in non-human mammals suggest that such associations could be mediated by DNA methylation changes in spermatozoa that influence offspring development in utero. Human obesity is associated with differential DNA methylation in peripheral blood. It is unclear, however, whether this differential DNA methylation is reflected in spermatozoa. We profiled genome-wide DNA methylation using the Illumina MethylationEPIC array in matched human blood and sperm from lean (discovery n=47; replication n=21) and obese (n=22) males to analyse tissue covariation of DNA methylation, and identify whether this covariation is influenced by obesity. Results DNA methylation signatures of human blood and spermatozoa are highly discordant, and methylation levels are correlated at only a minority of CpG sites (∼1%). While at the majority of these sites, DNA methylation appears to be influenced by genetic variation, obesity-associated DNA methylation in blood was not generally reflected in spermatozoa, and obesity did not influence covariation patterns. However, one cross-tissue obesity-specific hypermethylated site (cg19357369; chr4:2429884; P=8.95 × 10−8; beta=0.02) was identified, warranting replication and further investigation. When compared to a wide range of human somatic tissue samples (n=5,917), spermatozoa displayed differential DNA methylation in pathways enriched in transcriptional regulation. Conclusions Human sperm displays a unique DNA methylation profile that is highly discordant to, and practically uncorrelated with, that of matched peripheral blood. Obesity only nominally influences sperm DNA methylation, making it an unlikely mediator of intergenerational effects of metabolic traits.


Figure 1. Intergenerational epigenetic inheritance via spermatozoa and overview of study cohorts A)
Mechanism for how acquired paternal phenotypes could alter offspring physiology via epigenetic alterations to a man's spermatozoa. Epidemiological studies suggest that some acquired paternal traits, including obesity and insulin resistance, are associated with an increased risk of fathering small for gestational age (SGA) offspring [18,19,58]. Studies in non-human mammals suggest that such associations could be mediated by DNA methylation alterations in spermatozoa that induce metabolic reprogramming in the developing foetus [12]. B) Overview of study cohorts. The discovery cohort included 47 lean males (BMI 19-25 kg/m 2 ) and the replication cohorts included 22 lean males (BMI 19-25 kg/m 2 ) and 21 overweight/obese males (BMI >26 kg/m 2 ; 'the obesity cohort'). Age (years) and BMI (kg/m 2 ) are expressed as mean (SD). SGA: small for gestational age. SD: standard deviation.
It will be a long time before studies of DNA methylation in human spermatozoa reach a comparable magnitude to those currently available on peripheral blood. Therefore, it is of interest to identify CpG sites where DNA methylation levels covary between the two tissues, that is, sites at which blood methylation is predictive of sperm methylation, even if the absolute level of methylation is different. The extent to which these sites overlap with those identified in blood as associated with environmental stimuli or acquired phenotypes will provide new insight into whether the sperm methylome may be similarly responsive. At such CpG sites, using blood DNA methylation as a proxy for inferring DNA methylation in spermatozoa might be justified. To our knowledge, the largest study that analysed genome-wide DNA methylation in an unbiased manner in matched samples of blood and sperm to date included a total of eight participants [20].
In this study, we analysed genome-wide DNA methylation using the Infinium MethylationEPIC array in matched samples of human blood and sperm from lean (n = 68) and overweight/obese (n = 22; 'the obesity cohort') healthy males of proven fertility. We interrogated the extent to which obesity-associated DNA methylation in blood is reflected in spermatozoa from obese males and identified obesity associated CpG-sites in sperm and blood. Spermatozoal DNA methylation data was further compared to that of nearly 6,000 somatic tissue samples available on the Gene Expression Omnibus data repository [21], allowing us to identify sperm-specific DNA methylation signatures. Together, our analyses interrogate the plausibility of spermatozoal DNA methylation as a mechanism for intergenerational effects of paternal obesity and whether whole blood can be used as a surrogate tissue for analyses of DNA methylation when sperm is unavailable. Further, they provide a unique insight into how spermatozoal DNA methylation compares to DNA methylation in a wide range of human somatic tissues.

General characterisation of the sperm DNA methylome
We used the Illumina MethylationEPIC array to quantify DNA methylation at > 850,000 CpG sites across the human genome in matched samples of whole blood and sperm from a discovery cohort of 47 lean, healthy males of proven fertility. Following pre-processing, normalization and stringent quality control (see Materials and Methods), a total of 704,356 probes were retained for further analyses. Raw and pre-processed DNA methylation data is available for download from the Gene Expression Omnibus  Table S1). Gene bodies in spermatozoa displayed overall high levels of DNA methylation, whilst sparser DNA methylation was seen around transcription start sites (TSS) and 5' untranslated regions (UTRs), as well as the first exons ( Figure 2B, Table S2).
In line with previous reports, we confirmed that the DNA methylation age estimator developed by Horvath [4] worked well in whole blood (r = 0.74, P = 2.55 ´ 10 -9 , Pearson's product moment correlation), but not in sperm (r = 0.26, P = 0.07, Figure S1A). This is likely because the Horvath DNA methylation was developed using only 45 samples of semen in a total of 7,844 samples (0.6%) of different tissue samples, including 4,180 blood-derived samples (53%) [4]. However, age could more accurately be predicted using the model recently developed by Jenkins and colleagues [22], which was specifically trained on sperm samples (r = 0.68, P = 1.78 ´ 10 -7 , Figure S1B).

DNA methylation in imprinted regions
Genomic imprinting refers to the phenomenon that genes are epigenetically regulated to be expressed in a parent-of-origin specific manner [23]. In spermatozoa, imprinted genes should be either completely unmethylated or fully methylated depending on the gene [23]. Conversely, in blood, the parent-of-origin driven allele-specific methylation should result in methylation values of around 50% for any given imprinted site. DNA methylation levels at CpG sites annotated to genes listed in the Geneimprint database (http://www.geneimprint.com/site/genes-by-species) were compared between spermatozoa and whole blood ( Figure S2). In the case of CpG sites annotated to genes that are known to be imprinted, we observed an enrichment of sites with median methylation 0.5 in whole blood, particularly for paternally imprinted genes (21% sites with median beta between 0.4 and 0.6 vs 3% of sites across the array-wide background; P < 1.00 ´ 10 -50 , Fisher's exact test), but also for maternally imprinted genes (11% of sites; P = 9.19 ´ 10 -9 ). For genes predicted to be imprinted according to the Geneimprint database, there was a less pronounced enrichment (paternal: 6% of sites; P = 0.01; maternal: 6% of sites; P = 0.04). No such enrichment was observed for spermatozoal DNA methylation in any of the four categories (P > 0.05). Because gene annotation on the methylation array is based only on proximity, this approach includes many CpG sites not actually located in imprinting control regions (ICRs). Therefore, we also compared DNA methylation distributions at sites which specifically fall into known human ICRs as reported by WAMIDEX (https://atlas.genetics.kcl.ac.uk). This second approach further confirmed an enrichment of probes with around 50% methylation located in ICRs in blood compared to sperm ( Figure S3). Strikingly, of the 169 CpG sites that fell into ICRs, the majority show median beta values around 0.5 (57% of sites with beta between 0.4 and 0.6, P < 1.00 ´ 10 -50 , Fisher's exact test vs array-wide background). On the other hand, nearly all of the 169 sites were completely unmethylated in sperm (94% with median beta < 0.2, P < 1.00 ´ 10 -50 ).
The sperm DNA methylome exhibits a more polarised genome-wide DNA methylation profile than blood We compared the overall distribution of DNA methylation levels across the blood and sperm genomes.
Sperm displayed a more polarised methylation profile compared to blood, i.e. that both low and high median levels of methylation were more commonly seen in sperm ( Figure 3A), with 33% of sites showing median beta < 0.2 in sperm vs 27% in blood and 49% of sites with median beta > 0.8 in sperm vs 35% in blood. Principal component (PC) analysis was performed across the full discovery dataset comprising the 704,356 probes that remained after filtering. The first PC, explaining 51.41% of the variance, clearly distinguished between sperm and blood, indicating that the tissue of origin was the primary determinant of differences in DNA methylation profiles (Figure S4). At the majority of interrogated sites, DNA methylation levels differed significantly between sperm and blood (n = 447,846 sites (64%), P < 9 ´ 10 -8 , paired t-test; Table S3). At 62% of these sites (n = 277,831 sites), sperm was relatively hypermethylated compared to blood.
A more detailed characterisation of the differences between the sperm and blood DNA methylomes was performed by comparing DNA methylation levels in sperm and blood across different genomic regions ( Figure 3B-C, Tables S5-S6). CpG islands and CpG island shores were found to be less methylated in sperm compared to blood (0.07 and 0.16 lower in sperm respectively, P < 1.0 ´ 10 -50 for both, paired t-test). CpG island shelves and CpG sites in open seas were relatively hypermethylated in sperm compared to blood (0.06 and 0.07 higher in sperm respectively, P < 1.0 ´ 10 -50 for both) ( Figure   3B, Table S5). Regions upstream of transcriptional start sites were relatively hypomethylated in sperm compared to blood (0.02 lower at TSS200 and 0.11 at TSS1500, P < 1.0 ´ 10 -50 for both), as were sites mapping to the 3'UTR (0.01 lower, P = 3.81 ´ 10 -5 ) or first exon (0.01 lower, P < 1.0 ´ 10 -50 ). Conversely, other transcribed regions were hypermethylated in sperm compared to blood, including gene bodies (0.02 higher, P < 1.0 ´ 10 -50 ), 5'UTRs (0.01 higher, P = 1.3.61´ 10 -32 ), and exon boundaries (0.02 higher, P = 2.80 ´ 10 -22 ; Figure 3C, Table S6). We replicated these differences in the lean replication (n = 21 lean males) and obesity cohort (n = 22 obese males) (Supplementary Material: Replication, Figure   S5, Table S3). A) Array-wide comparison of CpG methylation in sperm and blood, showing that both low (< 20%) and high (> 80%) DNA methylation levels are more commonly seen in sperm. Plotted is the distribution median DNA methylation levels across all individuals in the discovery cohort. B) The percentage of CpG sites that are relatively hyper-and hypomethylated in sperm compared to blood, and CpG sites where there is no significant difference in DNA methylation between the tissues, are shown according to CpG region. C) The percentage of CpG sites that are relatively hyper-and hypomethylated in sperm compared to blood, and CpG sites where there is no significant difference in DNA methylation between the tissues, are shown according to genomic region. TSS: transcription start site, UTR: untranslated region

Sperm has a unique DNA methylation profile enriched in pathways relating to transcriptional regulation
The Gene Expression Omnibus (GEO) is a publicly available data repository that contains DNA methylation data from a range of human tissue samples, most of which have been analysed using the Illumina Infinium HumanMethylation450 BeadChip (450K array) [21]. In order to investigate how the DNA methylation profile of spermatozoa compares to that of somatic tissues, DNA methylation data from 371 sperm samples (90 from our discovery, replication and obesity cohorts combined and 281 samples from GEO) was compared to that of 5,917 somatic tissue samples from male donors available on GEO (see Table S7 and Table S8 for details on tissue samples). Restricting analysis to CpG sites covered by both the EPIC and 450K arrays (n = 452,626 sites) we used linear regression to identify sperm-specific DNA methylation signals across the 6,288 samples. After Bonferroni correction, a total of 133,125 genome-wide significant CpG sites (29%) were identified as differentially methylated between sperm and somatic tissues (Table S9). At 18% of these sites (n = 109,290 sites) sperm was characterized by higher methylation levels than somatic tissues. This is in contrast to the paired analysis with blood and likely due to the nearly exclusive coverage of CpG islands on the 450K array. Gene Ontology (GO) enrichment analysis [24] revealed 272 GO terms amongst hypermethylated CpG sites ( Table S10). The main two categories of enriched pathways related to regulation of gene transcription (37 pathways) and neurological traits and functions (67 pathways). The latter is possibly driven by the relatively large proportion of brain and neuronal samples amongst the somatic tissues (16%). Of the 37 GO terms enriched amongst hypomethylated CpG sites, 8 (22%) related to sensory perception, particularly smell (Table S11). We repeated the same analysis removing unsorted tissues and tumours as well as cell lines (1,046 samples) and replicated virtually the same results.

Covariation of DNA methylation between sperm and blood is limited and most likely explained by genetic variation
We next explored whether, despite the blood and sperm DNA methylomes being highly distinct, there were CpG sites where the levels of DNA methylation covaried between the tissues. We used minimum variability criteria for sites to be tested to avoid correlations driven by individual outliers, similar to those used by Hannon and colleagues [15]: we selected sites for which the middle 80% of samples had a beta range ≥ 0.05 in both blood and sperm. This restricted our analyses to 155,269 variable sites. At 1,513 of these (~1%), DNA methylation levels were significantly correlated between the two tissues (P < 9 ´ 10 -8 , Pearson's product moment correlation; Figure 4A, Table S12).
Given the observation of several bi-and trimodal patterns of DNA methylation amongst highly correlated sites ( Figure 4B), we applied a combination of outlier analysis and k means clustering with manual verification, to identify which of the 1,513 significantly correlated CpG sites exhibit these patterns. The majority of correlated CpG sites (1,140 sites, 75%) showed a bimodal distribution and 205 sites (14%) showed a trimodal distribution of DNA methylation, both of which are suggestive of a strong genetic influence on DNA methylation or the measurement. Probes with the highest correlation coefficients tended to show clear trimodal patterns (Figure 4B), while a third of bimodally distributed probes (365) appear to be driven by single outliers (Figure S6). A subset of correlated sites (30 i.e. 2%) displayed a negative correlation between DNA methylation in sperm and blood ( Figure 4C) and at a small number of sites distinct trimodal methylation patterns are present in only one of the two tissues ( Figure 4D).
We cross-checked all correlated sites for known SNPs in the probe sequence using the dbSNP Human Build 151 database [25]. Nearly all probes (1,507; > 99%) were found to have known SNPs in the probe sequence, > 90% of which are in the CpG site itself (Figure 5). This would indicate that DNA methylation readouts at these sites are most likely measuring genetic variation rather than epigenetic state. Only a small subset (n = 6) of the CpG sites that were significantly correlated had no known SNPs in their probe sequence. Some of these nevertheless displayed bi-and trimodal patterns of DNA methylation suggestive of a genetically driven effect and could potentially constitute strong mQTLs ( Figure 4E).

Figure 4. Covariation of DNA methylation between blood and sperm.
A) The observed correlation of DNA methylation levels in sperm and blood (histogram) is plotted against the estimated null distribution (red density curve). A small percentage of sites display highly correlated DNA methylation levels (r > 0.8), and the observed distribution is overall slightly shifted to the right compared to the null distribution. B) cg02024240 (chr5:159669974) shows a strong DNA methylation correlation between blood and sperm and a trimodal methylation pattern suggestive of a genetically driven effect (r > 0.99, P = 4.68 ´ 10 -48 ). C) cg25317025 (chr18:47019823) is one of 30 sites showing a negative correlation between blood and sperm (r = -0.89, P = 5.14 ´ 10 -17 ). D) Some probes display striking differences in variability between the two tissues: cg20673407 (chr10:31040939) is characterized by a distinct trimodal pattern in whole blood while showing less overall variability in sperm (r = 0.82, P = 1.45 ´ 10 -12 ). E) Only 6 of the significantly correlated probes have no known SNPs anywhere in the probe sequence. cg02486009 (chr15: 22428395) is one of these (r = 0.96, P= 1.90 ´ 10 -27 ). Nonetheless it shows a bimodal DNA methylation pattern in both tissues, suggestive of a genetically driven effect. Secondly, we overlapped our correlated CpG sites with a list of recently reported correlated regions of systemic interindividual variation (CorSIV) in DNA methylation [26]. Only 0.2% of non-correlated variable probes are contained in CorSIVs -in line with the low overall genomic prevalence of these regions (0.1% of the human genome). Strikingly, we observe a 10-fold enrichment of this within the correlated sites (2.2%, P = 8.85 ´ 10 -25 , Fisher's exact test). The observations from the sperm data suggest that for sites exhibiting bi-and trimodal methylation patterns there is a likely genetic origin (of either a SNP in the CpG site or strong methylation QTL effects). Therefore, this enrichment conflicts with the hypothesis that for at least these sites, the origin of cross-tissue covariation is developmentally established stable epialleles [27]. Finally, using cis DNA methylation QTL data from whole blood published by McClay and colleagues [28] we found that 232 (30%) of the correlated sites also present on the 450K array had previously been identified as mQTLs in whole blood, representing a significant enrichment over the 16% observed across all variable probes (P = 1.66 ´ 10 -33 , Fisher's exact test).
Correlations largely replicated in the two replication cohorts. (Supplementary Materials: Replication, Table S12) and non-replicating sites were generally driven by outliers in the discovery cohort (examples shown in Figure S7).

Limited evidence for converging associations between DNA methylation and obesity from whole blood and sperm
We next investigated whether obesity was associated with DNA methylation in sperm or blood. At the 697,384 sites that passed quality control in the combined replication cohort, including lean and obese males, we used linear regression of DNA methylation on obesity status, controlling for estimated blood cell types in the blood dataset. No probes passed array-wide significance (P < 9 ´ 10 -8 ) in blood or sperm (Table S13). Given our small sample size, we leveraged published data from a larger EWAS of BMI in whole blood [1]; see Materials and Methods). First, we tested whether the 187 replicated array-wide significant probes (P < 1.0 ´ 10 -7 ) reported by Wahl and colleagues, which were also present in our data, were enriched in lower-ranked P values in our data, and secondly, we compared effect sizes at these 187 probes between our cohort and the published data. To make both analyses comparable we treated BMI as a continuous measure for these comparisons -as Wahl and colleagues had done in the original epigenome-wide association study. Both analyses confirmed enrichments of the reported associations in blood but not sperm: lower-ranked P values were enriched in blood (P < 1.3 ´ 10 -23 , Wilcoxon rank sum test) but not sperm (P = 0.06, Figure 6A) and similarly, the reported effects at the 187 probes were correlated significantly with effects observed in our blood data (r = 0.72, P < 1.0 ´ 10 -50 , Spearman's rank correlation, Figure 6B) but not in sperm (r = 0.13, P = 0.11, Figure 6C). This indicates that the associations identified by Wahl and colleagues do not generalize to sperm. Finally, to maximise power within our own sample, we ran a linear mixed effects model across the discovery and replication datasets, using the 692,265 probes that survived quality control in both datasets. DNA methylation was regressed onto tissue (blood versus sperm), age, batch and obesity status, while controlling for interindividual variation with a random effect (Table S13). This analysis found that methylation at one CpG site, cg19357369 (chr4:2429884), was significantly increased in obese men in sperm and blood (beta = 0.02, P = 8.95 ´ 10 -8 , Figure 6D).

Obesity does not significantly influence the covariation of DNA methylation between sperm and blood
To investigate whether the covariation of DNA methylation was significantly altered in obesity, we ran an interaction model that regressed DNA methylation in blood onto DNA methylation in sperm, obesity status and their interaction effect, while covarying for experimental batch and age (see Materials and Methods). We identified 98 CpG sites with a statistically significant interaction between obesity and the association of blood and sperm DNA methylation (P < 9 ´ 10 -8 ). Interactions at the vast majority of these CpG sites (96) were driven by individual outliers in the obese cohort ( Figure S8A-C); the remaining two sites appear to be driven by outliers in the lean cohort and a batch effect ( Figure S8D). We therefore conclude that we were not able to identify credible altered DNA methylation covariation patterns between blood and sperm that may have arisen as part of a gene-environment interaction. , 187 were also present in our replication cohort of lean and obese men. We regressed BMI onto DNA methylation in each tissue, controlling for estimated blood cell types in the blood analysis to match the analysis used by Wahl and colleagues. A) Lower-ranked P values were found to be enriched amongst these 187 sites in blood (P < 1.3 ´ 10 -23 , Fisher's exact test) but not sperm (P = 0.06). B) Effect sizes at the 187 probes were significantly correlated between our blood data and the summary statistics published by Wahl and colleagues (r = 0.72, P < 1.0 ´ 10 -50 , Spearman's rank correlation). C) No such correlation was observed for our sperm data (r = 0.13, P = 0.11). D) In a linear mixed effects model across the discovery and replication datasets, DNA methylation was regressed onto tissue (blood versus sperm), age, batch and obesity status, while controlling for interindividual variation. This analysis identified significant hypermethylation at one CpG site, cg19357369 (chr4:2429884), in obese compared to lean men across the two tissues (beta difference = 0.02, P = 8.95 ´ 10 -8 ).

Discussion
In this study, we characterized the sperm methylome in relation to blood and other somatic tissues, investigated covariation between DNA methylation in sperm and whole blood and analyzed DNA methylation patterns associated with obesity. We conclude that the DNA methylation profiles of sperm and blood are highly distinct, and that there is little evidence of DNA methylation covariation between the two tissues, beyond genetic and technical effects.
In line with previous, smaller-scale studies, we showed that the sperm DNA methylome is highly polarised compared to that of blood, with both low (beta < 0.2) and high (beta > 0.8) levels of DNA methylation more frequently observed in sperm than in blood [20]. In contrast to previous research, however, we found that the sperm DNA methylome is overall slightly hypermethylated compared to that of blood [20,29,30]. This finding is potentially influenced by the fact that the previous generations of DNA methylation arrays (the 450K array) included a higher proportion of CpG islands, which are relatively hypomethylated in spermatozoa [20,31].
We identified significant differences in DNA methylation levels at the majority of assayed CpG sites when comparing whole blood to sperm. Additionally, in our comparison of the spermatozoal DNA methylome to that of almost 6,000 somatic tissue samples, we showed that gene ontology terms enriched amongst hypermethylated CpG sites in sperm pointed repeatedly to transcriptional regulation.
This is an intriguing finding considering that recent research has shown that high overall levels of transcription during spermatogenesis facilitate transcription-coupled DNA repair mechanisms through so-called "transcriptional scanning" [32]. Given that transcriptional regulation is an essential process for all cell-types, it is striking to observe sperm-specific DNA methylation patterns enriched in these processes. It could suggest that DNA methylation is involved in widespread transcriptional downregulation as cells progress from an active transcriptional stage during spermatogenesis to a more transcriptionally repressed stage in mature sperm.
About 1% of variable sites in whole blood and sperm showed a significant correlation of DNA methylation between the whole blood and sperm. This is slightly lower than what has been reported for comparisons of DNA methylation between whole brain and peripheral tissues [33]. Furthermore, at the vast majority of correlated CpG sites, the correlation appeared to be driven by underlying genetic variation resulting in characteristic bi-and trimodally clustered distributions of DNA methylation. In most of these cases, known SNPs were identified in the CpG site itself or in the single base extension. This finding is further supported by the observed enrichment of mQTLs [28] and CorSIVs [26] amongst correlated sites. Thus, whilst we lack specific genotyping information on individual participants in this study, our findings strongly suggest genetic variation as the underlying cause of DNA methylation covariation between blood and sperm. This is despite the fact that we employed stringent filtering of probes in close proximity to SNPs from previously published lists [31,34,35], which suggests a need to update existing reference lists.
We also identified a small number of CpG sites where DNA methylation was negatively correlated between blood and sperm, and sites where DNA methylation exhibited a trimodal distribution pattern in one tissue only. It would be of interest to investigate further whether pathophysiological traits are associated with an increase in DNA methylation in one tissue and a decrease in the other. In particular, whether germ cell or leukocyte specific transcription factors are responsible for the discordant yet correlated DNA methylation distribution patterns across blood and sperm.
The small number of sites (6 out of 1,513) where no obvious genetic driver of methylation variability was identified are likely too few to be of value in studies where blood is needed as a surrogate tissue for sperm. The results of this study are generally in line with similar studies of DNA methylation covariation, such as between whole blood and various brain regions [15], albeit more extreme. They This study identified one CpG site, cg19357369, as hypermethylated in sperm and blood from obese versus lean males. The finding should be interpreted with caution as it requires replication and just passed the array-wide multiple testing threshold -which was not corrected for the different aspects pertaining to sperm DNA methylation across the study (comparison with blood, correlation with blood, interaction, single-tissue EWAS, multi-tissue EWAS). The effect size was also comparatively small (beta = 0.02). cg19357369 is found upstream of the lncRNA RP11-503N18, which has yet to be characterised in terms of biological function [36]. However, previous research has shown that DNA methylation at cg19357369 is significantly altered during human fetal brain development [37]. Although cg19357369 has previously been identified as differentially methylated in hepatic tissue from obese compared to lean males lean males [36], it has not previously been identified in EWASs of obesity or BMI when only blood samples have been analysed. If shown to be replicable, it could point towards the possibility of an obesity associated signature of spermatozoa.
Overall, we found that differentially methylated CpG sites associated with BMI in a large-scale EWAS in blood were not evident in sperm. Therefore, our current understanding of epigenetic associations of weight-associated phenotypes, which stems almost exclusively from studies of whole blood, is unlikely to give us functional insights into how these may be passed to offspring.
There are limitations to our study. First, it constitutes an observational, cross-sectional study and we are therefore unable to comment on the causality behind observed associations between obesity and spermatozoal DNA methylation. The limited sample size of the obesity cohort (n = 22) reduced our ability to detect modest effects of obesity on DNA methylation covariation between sperm and whole blood. The obesity cohort included a proportion of overweight males (BMI 25-30 kg/m 2 ), which potentially diluted our results. Further, while we used the most comprehensive DNA methylation array currently available, the MethylationEPIC array is still biased towards certain parts of the genome (most notably enhancer regions, RefSeq genes and CpG islands) and does not give a complete picture of genome-wide CpG methylation [38]. Lastly, although we were able to speculate as to the effects of genetic variants in CpG sites influencing our results, given trimodal methylation patterns and the presence of known SNPs in the CpG site, we did not have the actual genetic sequence of our subjects to verify this directly.
The study has several strengths. It constitutes the largest unbiased analysis of DNA methylation in matched human sperm and blood samples performed to date, and is one of the largest studies of spermatozoal DNA methylation in healthy males of proven fertility. In contrast to several previous analyses of DNA methylation in human spermatozoa [39][40][41], our study includes a replication cohort, increasing the robustness of our findings. Crucially, our analyses include the use of large existing datasets; blood-sperm correlated CpG sites were interrogated for overlap with previously identified mQTLs in whole blood [28] as well as with a list of recently reported CorSIVs [26]. We used findings from one of the largest studies of obesity-associated DNA methylation in blood performed to date [1] to analyse whether effects of obesity observed in blood overlapped with those observed in sperm. Lastly, we used recently developed DNA methylation analysis pipelines for large DNA methylation datasets [42] to identify sperm-specific DNA methylation signatures by comparing spermatozoal DNA methylation data to that of almost 6,000 somatic tissue samples available on GEO [21]. Together, these analyses allowed us to interrogate the spermatozoal DNA methylome in novel ways and provide highly suggestive evidence for why DNA methylation as a mechanism for intergenerational effects of obesity in humans is unlikely.

Conclusions
Our data suggests that compared with a wide range of somatic tissues, human sperm displays a unique DNA methylation profile, particularly in pathways relating to transcriptional regulation. We show that DNA methylation levels in human blood and sperm are only correlated at a minority of CpG sites and that at such sites, DNA methylation covariation is most likely due to genetic effects. The use of peripheral blood as a surrogate tissue for human spermatozoa is therefore inadvisable. Obesity does not generally influence spermatozoal DNA methylation, nor the covariation of DNA methylation between blood and sperm. Further, obesity-associated CpG sites identified in peripheral blood do not show enrichment in spermatozoa from obese individuals. Taken together, our findings suggest that if there are inter-and transgenerational effects of human obesity, they are unlikely to be mediated by changes in spermatozoal DNA methylation.

Samples
Whole blood and semen samples were collected from participants recruited from University College London Hospital (UCLH) May 2016 -March 2019. Participants were phenotyped with regards to BMI, waist circumference, systolic and diastolic blood pressure, blood lipids, fasting insulin and glucose levels and C-reactive protein (CRP). Phenotypic information about participants is detailed in Table S4.
Participants provided information about their medical history and lifestyle via questionnaires, and were excluded if they suffered from significant medical conditions or took regular medications. All participants were of proven fertility. Peripheral blood samples were centrifuged at 3000g for 15 minutes within one hour of venepuncture and the buffy coat was used for DNA extraction.
Semen samples were processed within one hour of sample production as per UCLH protocol and analysed for sperm concentration, motility and average progressive velocity using the Sperminator/Computer Assisted Sperm Analysis system (Pro-Creative Diagnostics, Staffordshire, UK).
Semen sample parameters are detailed in Table S14. All semen samples were within normal parameters according to World Health Organization criteria [43]. Samples underwent gradient centrifugation (45 and 90% PureSperm medium; PureSperm 100®, Nidacon Laboratories, PS100-100) to select for motile spermatozoa as described elsewhere [44]. The processed samples were microscopically assessed for cell purity such that only samples with no visible cells other than spermatozoa were included in downstream analyses.

DNA extraction
DNA from 200 µL buffy coat derived from whole blood was extracted using Qiagen QIAamp DNA Blood Mini Kit (Qiagen, Cat No. 51104) according to manufacturer's instructions [45]. DNA from the pellet of motile spermatozoa was extracted using a standard phenol-chloroform extraction method as described previously [46]. DNA extracted from whole blood and sperm was quality controlled using a Qubit 3.0 . Samples were assigned a unique code for identification and randomized with regards to cohort and other variables to avoid batch effects, and processed in two batches. The Illumina Genome Studio software was used to extract the raw signal intensities of each probe (without background correction or normalization). Raw DNA methylation data is available for download from GEO (accession number GSE102538).

Data pre-processing
Data analysis was performed in R version 3.6.2. DNA methylation data was processed and analysed using the wateRmelon package in R [48]. An initial outlier analysis was performed using the outlyx() function in wateRmelon based on 1) the interquartile range of the first principal component and 2) the pcout algorithm [50] detecting outliers in high dimensional datasets, leading to the removal of 1 individual from the discovery cohort, 2 individuals from the obesity cohort and 3 Individuals from the lean replication cohort. The 59 non-CpG SNP probes on the array were used to confirm that the genotypes at these 59 probes were identical for the matched samples.
Prior to data analysis, 9,779 probes were removed from the discovery data because more than 5% samples displayed a detection P value > 0.05. Furthermore, 3,337 probes were removed because of having a bead count < 3. Probes containing SNPs in close proximity to the CpG site (within 10 base pairs) as well as potentially cross-reactive probes were filtered using annotated lists from three sources [31,34,35], leading to the removal of 149,105 CpG sites. The final discovery data set comprised 704,356 CpG sites. Data was normalized in the R package wateRmelon using the dasen() function as previously described [48]. The lean and obese replication cohort were processed together experimentally and therefore jointly pre-processed and normalised using the same parameters as for the discovery dataset. A total of 697,442 probes survived quality control and filtering in the replication data. DNA methylation was analysed and reported as beta values, which is the ratio of methylated probe intensity over the overall intensity and approximately equal to the percentage of methylated sites (% DNA methylation). For plotting purposes, beta values are shown and described and shown as percent DNA methylation.

Characterization of DNA methylation in sperm
CpG sites were assigned to chromosomes, locations, genes, and genomic regions using the Illumina

DNA methylation age estimates
DNA methylation age was estimated on the discovery sample from both blood and sperm DNA methylation using Horvath's DNA methylation age estimator [4]. We additionally estimated DNA methylation age from sperm using the method described by Jenkins and colleagues [22].

Annotation of imprinted genes/ imprinting control regions
CpG sites were annotated to imprinted genes using the Illumina manifest for the EPIC array and the list of imprinted genes published in the Geneimprint database (http://www.geneimprint.com/site/genes-byspecies

DNA methylation differences between blood and sperm
Sites characterized by differences in DNA methylation between whole blood and sperm were identified by a paired t-test of matched samples. Comparison of the difference in DNA methylation levels between sperm and blood at different genomic regions was performed by calculating a paired t-test of median DNA methylation in sperm vs blood across all sites annotated to a specific genomic or CpG region.

GEO analysis
DNA methylation data for 6,288 samples was downloaded from the Gene Expression Omnibus (GEO) including 281 sperm samples and 5,971 somatic tissue samples from male donors, profiled using the 450K or EPIC arrays. Statistical analyses were performed using the bigmelon package in R and statistical tests were performed using limma [42,49]. In the comparison of DNA methylation between sperm and tissue samples from males on GEO, a linear model was fitted using the lmFit() function from the limma R package [49] [52], which removes ambiguously assigned probes from the enrichment analysis.

Correlation between whole blood and sperm DNA methylation
In order to minimise the effect single outliers would have on the correlation analysis, a subset of 'variable' probes was identified by calculating the DNA methylation difference between the 10 th and 90 th percentile across all samples, and selecting sites where this was at least 0.05 in both whole blood and sperm (n = 155,269 sites). This approach is similar to the one described by Hannon and colleagues previously [15]. Correlated CpG sites between sperm and blood were identified by Pearson's correlation test across all variable probes. In order to establish the matching null distribution, samples were permuted 100 times and correlations between DNA methylation in whole blood and sperm were recalculated across all variable sites. The density curve of these simulated correlations was added to the histograms of the empirical correlation coefficients to represent the null distribution (Figure 4). To investigate the clustering of DNA methylation patterns at significantly correlated CpG sites, a two dimensional outlier test was used by adapting the rosnerTest() function from the EnvStats R package [53] to exclude unimodal distributions. Next, k means clustering was applied for 2 and 3 clusters as implemented in the function pamk() of the R package cluster [54]. This function determines the best fitting number of clusters (two or three -corresponding to bi-and tri-modal methylation distributions).
We manually checked and, if necessary, reassigned clusters which exhibited low between-cluster to within-cluster variance ratios (ratio < 2).

Annotation of SNPs and genetic enrichments
To annotate SNPs to their location within probe sequences we used the Illumina EPIC hg38 manifest and dbSNP database build 151 in the SNPlocs.Hsapiens.dbSNP151.GRCh38 R package. SNPs were mapped to probes using the GenomicRanges R package [51] and the distance to the CpG site of the closest SNP in the probe sequence was calculated for each of the 1,513 probes with significant correlations between sperm and blood. We downloaded the locations of the 9,226 correlated regions of systemic interindividual variation (CORSIV) in DNA methylation recently published by Gunasekara and colleagues [26]. These were overlapped with the locations of CpG sites using the hg38 manifest and the GenomicRanges R packages. Finally, we downloaded the list of cis methylation QTLs (mQTLs) in blood reported by McClay and colleagues [28]. These were identified using the 450K array, which meant we had to restrict this annotation to probes present on both the EPIC and 450K array.
Enrichments for CORSIVs and mQTLs were calculated by Fisher's exact test against the background of non-correlated variable probes.

Obesity and DNA methylation in blood and sperm
Two models were used to investigate the association between obesity and DNA methylation in sperm and blood. First, DNA methylation was regressed onto obesity status in the combined replication cohort, in blood and sperm separately. This analysis was controlled for estimated blood cell counts in blood.
Secondly, a mixed effects model was run across both the discovery and replication cohorts using the lmer() function from the lme4 package in R [55], regressing DNA methylation onto tissue (blood versus sperm), age, batch and obesity status, while controlling for interindividual variation with a random effect:

lmer(Methylation ~ Tissue + Age + Batch + Obesity +(1|ID))
Given our small sample size -especially in the obese group -we downloaded summary statistics from an EWAS of BMI in whole blood [1]. 187 of the replicated array-wide significant probes (P < 1.0 ´ 10 -7 ) reported by Wahl and colleagues were also present in our dataset. To make our data comparable we treated BMI as a continuous measure for these comparisons, regressing BMI onto obesity status and controlling for estimated blood cell proportions in the blood analysis. We tested for an enrichment of lower ranked P values amongst the 187 previously reported probes in our analysis using a Wilcoxon rank sum test. Secondly, we looked at correlations of effect sizes reported by Wahl and colleagues and observed in our data across the 187 probes using Spearman's rank correlation to allow for studyspecific biases.

Interaction between obesity, tissue and DNA methylation
To detect and interaction between obesity and the association between blood and sperm DNA methylation we ran linear model regressing DNA methylation in blood onto DNA methylation in sperm, obesity status and their interaction effect, while covarying for experimental batch and age: lm(MethylationBlood ~ Methylationsperm * Obesity + Age + Batch)

Cell-type composition
As whole blood represents a heterogenous tissue where the composition of leukocytes can introduce bias in the interpretation of DNA methylation analysis findings, blood cell type counts of monocytes,

granulocytes, NK-cells, B cells, CD8+-T-cells, and CD4+
-T-cells were estimated from the DNA methylation data using the method described by Houseman [56]. These estimates were included in all analyses that were run on the blood dataset alone as described above.

Multiple testing correction
For agnostic analyses across the whole EPIC array (including those restricted to variable probes), the threshold P < 9 ´ 10 -8 as reported in recently published statistical guidelines for the EPIC array [57].
For the GEO analysis only the set of probes present on both the 450K and EPIC array were used. We applied Bonferroni correction across these 452,626 sites.

Ethics approval and consent to participate
Ethical approval for the study was granted from the South East Coast -Surrey Research Ethics

Consent for publication
Not applicable.

Availability of data and materials
The datasets supporting the conclusions of this article are available in the Gene Expression Omnibus repository, under GEO accession number GSE149318.

Competing interests
The authors declare that they have no competing interests.  Sites showing > 0.8 median beta value were classified as "high", sites with median beta < 0.2 as "low". Enrichments of each region amongst "high" and "low" methylation sites were calculated against the annotation of intermediately methylated sites (median beta between 0.2 and 0.

Supplementary Table 2. Enrichments of genomic region annotations across sites showing extreme methylation values in sperm.
Sites showing > 80% median DNA methylation were classified as "high", sites with < 20% methylation as "low". Enrichments of each region amongst "high" and "low" methylation sites were calculated against the annotation of intermediately methylated sites (20-80% median DNA methylation) using a Fisher's exact test.

OR = odds ratio
Supplementary Table 3. Summary statistics for differences in DNA methylation between whole blood and sperm.
We used a paired t-test to identify DNA methylation differences between whole blood and sperm across all 704,356 probes passing quality control in the discovery dataset. Summary statistics are reported for all sites in the discovery dataset. Summary statistics from the replication cohort are reported for sites that also passed quality control in our replication dataset. IlmnID = Illumina CpG identifier, chr = chromosome, location = position on chromosome in hg19 reference, P = p-value in the discovery data, effect = effect size in the discovery data, P_rep = p-value in the lean replication cohort, effect_lean = effect size in the lean replication cohort, P_ob = p-value in the obese replication cohort, effect_ob = effect size in the obese replication cohort. Using a paired t-test the DNA methylation difference between the median methylation in blood and sperm was calculated for each region. The DNA methylation difference is shown with respect to blood (a positive value indicating higher average DNA methylation in sperm).

Supplementary Table 13. Summary statistics for the association between DNA methylation and obesity in whole blood and sperm.
We regressed DNA methylation onto obesity status in our replication cohort, separately in whole blood and sperm, controlling for estimated blood cell type proportions in the blood analysis. We furthermore used a linear mixed effects model across the combined discovery and replication datasets, regressing DNA methylation onto obesity status, tissue type and batch while controlling for interindividual variation. Summary statistics for both analyses are reported -the LME results are restricted to sites available in both the discovery and replication datasets. IlmnID = Illumina CpG identifier, chr = chromosome, location = position on chromosome in hg19 reference, P_blood = p-value in blood analysis, effect_blood = effect size in whole blood, P_sperm = p-value in sperm analysis, effect_sperm = effect size in sperm, P_mix = p-value in the mixed effects model, effect_mix= effect size in the mixed effects model. All effect sized are reported using the lean men as reference group.

A B
Supplementary Figure 1. DNA methylation age prediction in whole blood and sperm. A) As reported previously, the DNA methylation age predictor by Horvath was able to accurately predict chronological age from DNA methylation in whole blood (r = 0.74, P = 2.55 ´ 10 -9 , Pearson's product moment correlation) but not in sperm (r = 0.26, P = 0.07). B) However, chronological age could be more accurately predicted from DNA methylation in sperm using the predictor more recently developed by Jenkins and colleagues (r = 0.68, P = 1.78 ´ 10 -7 ). DNA methylation annotated to known imprinted genes (Geneimprint database; http://www.geneimprint.com), showed a characteristic enrichment in sites with beta around 0.5 (+/-0.1) in whole blood -particularly, those genes known to be paternally imprinted (P < 1.00 ´ 10 -50 , Fisher's exact test), but also for maternally imprinted genes (P = 9.19 ´ 10 -9 ) and a less pronounced enrichment in genes predicted to be imprinted paternally (P = 0.01) or maternally (P = 0.04). No such enrichment was observed in sperm (P > 0.05 for all four tests).  The effect sizes at the 441,764 significant probes from discovery, which were also present in the replication cohorts, were highly correlated with those observed in the replication cohorts (lean cohort: r = 98%, P < 1.0 ´ 10 -50 ; obese cohort: r = 0.99, P < 1.0 ´ 10 -50 ). Shown is DNA methylation in whole blood and sperm from the discovery and replication cohorts at A) cg02474032 (chr16:87678659), B) cg25554892 (chrX:70434406), and C) cg07636088 (chr13: 31734946). We observed higher measured DNA methylation in the individual outlier at less than 2% of these 365 sites.

in both cohorts)
A) The majority of these sites (127 sites; 76%) were characterized by a single outlier in the discovery cohort, without any outliers in the replication cohorts. One example is found at cg27045994 (chr16:87678659). B) cg25253080 (chr10:14795564) represents the only incidence where a group of 5 outliers did not replicate in either replication cohort. C) The biggest outlier group which did not replicate contained 6 individuals, with no outliers in the replication data and was found at cg27045994 (chr8:284126). D) The only trimodal distribution which did not replicate was observed at cg17118288 (chr1:218563763). The majority of significant interactions between sperm and blood DNA methylation and obesity were driven by single or very few outliers in the obesity group. A) At cg23132872 (chr2:191882300), the correlation in obese individuals is driven by a single outlier. B) At cg22086461 (chr8:77343728) the correlation in obese individuals is driven by two outliers. C) At cg17166874 (chr7:155381422) the correlation in lean men is driven by four outliers in the discovery cohort and methylation at this site is also characterized by substantial batch effects. D) At cg19778375 (chr12:297831) there appears to be a batch effect between the discovery and replication cohort that contributes to an observed correlation in the lean men from the discovery cohort, which is not present in the replication datasets.