Tumor purity adjusted beta values improve biological interpretability of high-dimensional DNA methylation data

A common issue affecting DNA methylation analysis in tumor tissue is the presence of a substantial amount of non-tumor methylation signal derived from the surrounding microenvironment. Although approaches for quantifying and correcting for the infiltration component have been proposed previously, we believe these have not fully addressed the issue in a comprehensive and universally applicable way. We present a multi-population framework for adjusting DNA methylation beta values on the Illumina 450/850K platform using generic purity estimates to account for non-tumor signal. Our approach also provides an indirect estimate of the aggregate methylation state of the surrounding normal tissue. Using whole exome sequencing derived purity estimates and Illumina 450K methylation array data generated by The Cancer Genome Atlas project (TCGA), we provide a demonstration of this framework in breast cancer illustrating the effect of beta correction on the aggregate methylation beta value distribution, clustering accuracy, and global methylation profiles.


Introduction 31
Epigenetic alterations in tumor cells are a hallmark of cancer, and changes in the DNA methylome represent 32 the first, and to date the best characterized, example of a bona-fide epigenetic mechanism for altering gene 33 regulation in cancer. Understanding epigenetic alterations and their effect on tumorigenesis has therefore 34 long been a research focus in cancer and numerous attempts at capturing epigenotypes of cancer have been 35 made over the years [1-6].

36
Carcinogenesis is however a complex process involving both epigenetic and genetic insults to the human 37 genome as well as an intricate interplay between the tumor compartment and the surrounding normal tissue 38 as well as cells of the immune system. Consequently, resected bulk tumor samples are not completely pure 39 with respect to tumor cells, but instead consist of a complex mixture of cell types representative of the 40 tumor and its microenvironment. This mixture of malignant and different non-malignant cell types has 41 profound implications on the readout of, e.g., high dimensional genomic profiling techniques. For instance, 42 for somatic mutations occurring in tumor cells the observed variant allele frequency, i.e. the number of 43 sequence reads with a variant divided by the total number of reads covering a specific locus, becomes a 44 function of the proportions of malignant and non-malignant cells. As different cell types also have different 45 epigenetic states it follows that this admixture of cell types also creates a mixture of epigenetic states in the 46 analyzed tissue, making the discovery and characterization of pure epigenetic subtypes of a given cancer 47 type a challenging task.

48
To circumvent the issue of mixed cell types in bulk tumor specimens and peripheral blood different 49 approaches have been proposed to deconvolve RNA-sequencing [7-9] or global DNA methylation data 50 [10-14] to provide estimates of the magnitude and/or nature of the tumor and non-tumor compartments. 51 These methods have often focused on quantifying the immune component of the tumor compartment and 52 relied on flow-sorted gene expression or DNA methylation data generated from purified blood cell types as 53 the basis for deconvolution.

54
Another category of methods has instead focused on quantifying the purity of tumor samples from 55 DNA/RNA-sequencing or DNA methylation data without trying to delineate cell types within the non-56 tumor compartment. Methods for deriving purity estimates from DNA methylation data have been reported 57 by several groups, e.g. [15][16][17][18]. With respect to the performance of these methods for estimating sample 58 purity, sequencing-based estimates have generally constituted the golden standard measure of comparison 59 in the respective studies.

60
Although most developed and used tools for DNA methylation deconvolution and/or tumor purity analysis 61 have been aimed at quantifying tumor purity or the admixture of infiltrating cell types, few have been aimed 62 at correcting for it globally, but have rather proposed using purity as a covariate in group comparisons 63 and/or clustering [17,[19][20][21][22]. In the case of tools such as those available in the InfinuimPurify-package, 64 which includes a number of approaches directed at controlling for purity in epigenomic analyses, these 65 methods frequently make use of matched reference normal samples for establishing the baseline 66 methylation state of the non-tumor compartment and are currently only implemented on the Illumina 67 Infinuim 450K platform [19]. The requirement of available "normal" baseline data is not always simple to 68 meet and although the Infinium 450K remains the historically most used methylation profiling platform, it 69 has subsequently been replaced by the EPIC 850K array [23].

70
Accounting for confounding signal derived from the tumor microenvironment (TME) is conceptually 71 straightforward and a theoretical perfect separation is obtainable at CpGs in which the tumor and non-tumor 72 compartment respectively show homogenous and diametrically opposed DNA methylation states. Indeed, 73 this assumption forms the basis for most proposed correction and estimation algorithms and logic dictates 74 that it is only at sites where methylation differs between the tumor and non-tumor compartments that 75 deconvolution strategies can separate alleles contributed by the respective components of the aggregate 76 sample. For the purpose of estimating overall tumor purity this assumption typically only needs to be met 77 for a small minority of the hundreds of thousands of assayed CpG-sites in order for robust estimates to be 78 made. But in order to correct for the TME influence on tumor methylation estimates, one needs to account 79 for the fact that only a minority of tumor cells alter their methylation states in relation to the normal tissue 80 background. This invalidates a simple assumption of one tumor-and one normal compartment with 81 diametrically opposed methylation states, which, e.g., the InfiniumPurify-function implicitly makes [19]. beta values between the two utilized probe chemistries was adjusted for using the method of Holm et al. 137 [4]. As a normal reference data set, we used GSE67919 [39] downloaded from the Gene Expression 138 Omnibus (GEO) and corrected for the Infinium I/II effect in analyses related to our correction method. For 139 use with the "InfiniumPurify" method (below) we did not perform Infinium I/II correction on normal 140 samples based on the observation that this step influenced the performance of the method negatively (data 141 not shown).

142
143 A strategy to model and correct CpG methylation with respect to tumor purity

144
Based on the observation that DNA methylation and tumor purity often interact to produce mixed 145 methylation states that can be modelled by one or more linear functions we devised an algorithm in the R 146 programming environment [40] to correct for this effect that produces purified tumor-as well as inferred 147 "normal" methylation profiles for a cohort of samples to be corrected. The first objective of the algorithm 148 is to on a CpG basis identify the presence of up to three natural populations of samples that arise as a 149 product of the admixture of tumor and non-tumor cells, and then use the discovered populations in a second variables and with parameters K=1-3 and nrep=3. A best fit model is chosen using the function "getModel" 157 with "BIC" as criterion and the clusters (N=1-3) are obtained using the function "clusters". Next, a linear 158 model is fitted to each identified population using the original methylation estimate as the dependent and 159 1-purity as the independent variable to obtain a linear fit with the intercept serving as the pure tumor 160 methylation state. To obtain the normal background methylation state for each population the same 161 regression is performed with purity as the independent variable. The final methylation state is in each case 162 formed by adding the residuals of each fit to the obtained intercept(s). In a final step, values <0 and >1 are 163 set to zero and one, respectively. The individually adjusted CpG methylation estimates for each sample are 164 then aggregated to reform a data matrix of the same dimensions as the input beta matrix (CpGs as rows, 165 samples as columns). For improved run speed the function is implemented using the R-package "parallel" 166 so that calculations for separate CpGs are distributed across a user-specified number of available cores. To 167 produce deterministic results, each individual CpG is also paired with a unique RNG seed number that 168 makes each parallelized operation fully reproducible. Scripts required to perform beta adjustment are 169 available on github under the following link (https://github.com/StaafLab/adjustBetas).

172
The InfiniumPurify R package [19] contains functions for both the estimation of tumor purity (function 173 getPurity) and correction of beta values (function InfiniumPurify). The correction function requires a tumor 174 set as well as a normal data set (both N>20) in combination with a purity estimate for each tumor sample. 175 For all runs we used the 630 TCGA BRCA samples processed as described above. Normal methylation 176 profiles were obtained from GSE67919 and were not corrected for Infinium I/II effects when used with 177 InfiniumPurify. The DNA sequencing-based purity variable was obtained from [27].

181
Modeling DNA methylation as a function of tumor purity 182 We previously analyzed genome-wide DNA methylation patterns in the context of BRCA1-mutated vs 183 hypermethylated triple negative breast cancer (TNBC) [24]. As part of this study, we noted anecdotally that 184 CpG loci display different methylation characteristics with respect to tumor purity. An example of one of 185 these patterns is illustrated by the methylation state of a CpG in the BRCA1-promoter region in relation to 186 tumor fraction estimates derived from WGS ( Figure 1a) the mixture modeling algorithm being overly sensitive to minor differences in the data variance structure.

194
This effect can be mitigated by, e.g., adding a small component of normally distributed noise to the 195 methylation estimate, yielding two rather than three populations given the same input variables ( Figure 1c).

196
By using the intercept terms of the respective linear regression models for the two populations, as well as 197 purity and 1-purity as the independent variable, an estimate of the methylation state of both the "normal  subset of tumors diverges from the presumed somatic methylation state, we set out to define a framework 222 for correcting DNA methylation beta values that allows for the presence of multiple separate methylation 223 states in the tumor compartment, as well as estimation of the aggregate methylation state of the non-tumor 224 compartment without the requirement of a-priori information about the normal tissue methylation state or 225 paired normal data. For this we settled on an unsupervised approach using mixture modeling, in which the

226
FlexMix framework is applied on the level of individual CpGs for automated population discovery of 227 samples in a cohort with similar underlying methylation states confounded by non-tumor methylation using 228 a "purity" estimate as a regression covariate.

229
Our framework is agnostic to how the tumor purity estimate for a sample is obtained. Thus, estimates may 230 be obtained from analysis of genetic data by WES, WGS or SNP microarrays, epigenetic data (e.g., DNA 231 methylation arrays), or pathology estimates. The framework does however intuitively require a subset of 232 cases in a sample cohort to diverge from the presumed somatic methylation state for the population 233 identification to be robust. Based on empirical assessment, and to limit runtimes, we chose a maximum of 234 three populations to model in each iteration (CpG unadjusted beta values we found increased discrimination of the Basal vs Luminal split in terms of 252 specificity and accuracy using our approach, while beta dichotomization did not improve the discriminatory 253 power neither relative to unadjusted data nor in absolute terms (Figures 2b-c). Using InfiniumPurify yielded 254 better results than dichotomization in terms of most assayed metrics but did not outperform our proposed 255 method.

273
We also evaluated the aggregate beta distribution in uncorrected and corrected data respectively and found 274 that adjusted beta values were shifted towards the extremes of the beta distribution in comparison with 275 unadjusted beta values consistent with the conception of DNA methylation being binary (Figure 3b). To 276 evaluate the performance of our method for inferring the normal methylation state from analysis of bulk 277 tumor specimens we compared the inferred normal state data from the 630 TCGA breast cancer samples 278 against a public data set of 81 healthy breast tissue samples profiled on the same Illumina methylation 279 platform (GSE67919) [39]. Plotting the mean inferred beta values in our cohort against mean observed beta 280 values in the normal breast samples for all CpGs (5000 most varying) showed a good overall agreement 281 (Pearson's r=0.93, p<0.001, Figure 3c). We also calculated the correlation coefficients between each 282 individual inferred normal sample and the average methylation state of the same normal breast tissue CpGs 283 and contrasted these with correlations calculated from the corresponding tumor estimates before and after 284 purity adjustment (Supplementary Figure 1). This analysis showed a similar performance on the level of values of 0, 0.5, and 1 respectively. Given that our method produces valid estimates of the normal sample 318 methylome, we expect these estimates to mirror those seen in bona fide normal samples and adhere to the 319 beta distribution expected from biological theory. We therefore extracted CpGs mapping to X-chromosome 320 promoters (N=2921) from both the inferred normal methylomes and GSE67919. For both data sources we 321 plotted the median beta distribution and calculated the fractions falling into the three theoretical X-322 methylation categories (hypo, Xi, and hyper, Figure 4a). With respect to the three methylation bins the 323 concordance was 90.9%, and with a majority of X promoter CpGs being in a hemimethylated (Xi) state. A 324 minority of CpGs in both data sources were hypomethylated and a subset of these are expected to represent 325 promoters of genes that escape Xi. As a further test of biological coherence, we tested whether the 326 hypomethylated promoters resided in promoters of genes that escape Xi. For this we cross-referenced the 327 gene symbols annotations of the CpGs with a list of 75 genes that were found to escape Xi by Katisr and 328 Linal [43]. As expected, the hypomethylated bin was enriched for localization in promoters of genes 329 escaping Xi (4.6% of hypo CpGs vs 0.8% and 0.6% for Xi and hyper CpGs, respectively). Similarly, 330 plotting the beta distribution of all CpGs mapping to high CpG density positions (N=6736) in adjusted and 331 unadjusted tumor samples showed that purity-adjusted beta values adhere better to the theoretical 332 expectation (Figure 4b). promoters were more frequently found to overlap previously published genes that undergo X-escape. B) 339 Purity-adjustment of tumor sample methylation beta values result in a distribution that more closely adheres 340 to the distribution expected from the process of Lyonization.

342
The effects of purity adjustment on between-sample contrast at biologically significant classes of CpGs

343
To establish further biological validity for our beta correction approach, we analyzed pre-and post-344 adjustment beta values for the top 5000 most varying CpGs from the perspective of native genomic context, 345 and with respect to the 5-group hierarchical clusters, PAM50 subtypes, and TNBC vs non-TNBC cancers. methylation state at PRC-marked loci as normal breast tissue (Figures 3a and 5a) Figure 3). A small number of sites (N=52) among the top 361 5000 CpGs were annotated as having overlaps with all three features. These sites mapped to distal (>5kb 362 from nearest promoter) and ocean designated parts of the genome, consistent with these sites acting as distal 363 enhancers. This class of CpGs were densely methylated in Basal/TNBC tumors as well as inferred normal 364 breast tissue (Figure 5a and 3a respectively) and showed near-universal demethylation in non-Basal tumors. 365 Again, the contrast between pre-and post-correction betas showed considerably improved separation of, 366 e.g., Basal vs Luminal tumors, and highlights that non-tumor tissue again acts to lessen this contrast in 367 unadjusted data.

368
Finally, we chose to focus on distal CpGs without overlapping ENCODE transcription factor binding sites 369 (TFBS) (N=476) as representative of non-regulatory DNA methylation (Figure 5d). The CpGs were situated 370 in a low CpG density context and did not overlap TCGA ATAC peaks. The methylation state of these CpGs 371 showed a lower association with PAM50 subtypes or a TNBC designation, but instead had more of a 372 gradient-like methylation profile ranging from very low aggregate levels in cluster 4 (luminal A/B) tumors 373 to generally high in cluster 3 (luminal A/B) and 5 (Basal/TNBC) tumors. Generally, demethylation of low-374 density CpGs has been attributed to lack-of-maintenance or proliferation-related processes, however this 375 cannot likely be the cause in this instance as both HER2-enriched and Basal-like tumors displayed relatively 376 higher methylation levels. In general, Luminal B tumors tended towards lowest overall methylation levels 377 at "non-functional" CpGs, although this was in no way defining for the subtype overall. CpG level, ii) sample-level beta correlation, and iii) correlation of individual inferred normal samples to 399 mean methylation of GSE67919 normal samples. As expected, increasing amounts of noise in purity 400 estimates negatively affect all our calculated metrics proportionally to the magnitude of added noise ( Figure  401 6a). As a rule, the inferred normal estimates had the fastest decay rate with increasing noise, although this 402 effect was not as pronounced on the level of the cohort average as for individual estimates. Similarly, but 403 to a lower extent, FlexMix group calls on the level of individual CpGs were also affected by noise but 404 overall concordance with calls obtained using unperturbed purity estimates remained consistently high 405 (lowest median concordance ~0.9). In terms of sample-level adjusted beta values, a decrease in median 406 correlation and increase in correlation variability was observed but with above 0.95 median correlation in 407 the highest perturbation comparison. Overall, we found that our approach for beta adjustment appeared 408 robust to perturbations of at least moderate magnitudes (mean absolute purity shift in highest noise test 409~0.15). Additionally, we tested purity adjustment in the 235 TNBC samples with available WGS purity 410 estimates, Infinium 850K methylation data as well as gold standard pyrosequencing calls for BRCA1 411 promoter methylation [24]. The most varying 850K probe in the BRCA1 promoter showed excellent overall 412 agreement with pyrosequencing-based BRCA1 hypermethylation calls and applying the FlexMix-based 413 adjustment framework to this single CpG produced highly concordant results (Figure 6b). Analogously to 414 the first perturbation test, we evaluated the influence of increasing the noise level in the purity estimate, 415 this time to the BRCA1 promoter CpG only. We performed 50 iterations each of beta adjustment across 26 416 increasingly confounded purity estimates (noise N(0,s), s=0.02 to 0.5 in increments of 0.02). Somewhat 417 surprisingly, overall accuracy remained stable across the tested span albeit with a tendency towards 418 deterioration at higher noise levels ( Figure 6c). Interestingly, adding higher levels of pure noise to the purity 419 estimate results in a lower overall magnitude of adjustment, i.e., beta estimates are shifted less on average 420 the more deteriorated the purity estimate is (Figure 6d). With respect to the high maintained accuracy across 421 noise levels, this is a byproduct of only purity estimates being confounded by noise while keeping 422 methylation levels unchanged. The FlexMix framework is therefore still able to reliably separate tumors 423 with differing methylation states, but with increasingly poor performance in finding the correct intercept 424 for the tumor and normal populations as indicated in Figure 6a. with molecular subgroups of a malignancy. In most epigenetic analyses of bulk tumor tissue to date, the 445 effect of tumor purity on methylation estimates has not been fully addressed despite that it has long been 446 recognized that methylation estimates can be affected by differing epigenetic states present in the mixture 447 of malignant and non-malignant cell types that make up the bulk tumor. Here, we present a simple strategy 448 to adjust for this at an individual CpG level in high-dimensional DNA methylation data.

449
The concept of adjusting tumor methylation estimates using tumor purity estimates is not new and has been 450 proposed Taken together, we therefore believe there is still an unmet need to further develop flexible and generic 464 algorithms that can address the issue of correcting tumor methylation estimates. To this end, we utilized 465 automated population discovery using the FlexMix framework and linear regression-based adjustment of 466 CpG methylation beta values. We show that our approach yields intuitively interpretable and biologically 467 sound results when applied to a large cohort of breast cancer tumors collected by The Cancer Genome Atlas 468 project. The main advantage of our method is an intuitive and simple approach, a deterministic output, and 469 that the method is applicable cross-platform. We show that using a mixture modelling approach seems to 470 outperform at least simple dichotomization as well as performing regression-based adjustment modelling 471 of the tumor compartment as a single entity. Additionally, our approach can infer the "normal" background 472 methylome with reasonable accuracy, eliminating the need for reference methylomes when a sufficiently 473 large cohort is profiled. We believe that the general approach outlined in this work can be applied to most 474 types of methylation data, potentially with purity variables obtained from many different sources, including 475 the methylation array itself using tools included in packages such as InfiniumPurify [21] or MethylResolver 476 [13]. This general framework can be improved in future work by e.g., refinement of the population 477 discovery process and modelling of systematic bias affecting regression intercept terms and residuals. 478 Moreover, considering the continuously growing body of public methylation profiles for different 479 malignancies the concept of static reference sets usable for correcting small sized cohorts could be 480 investigated. As currently devised, the method runs the full 450K array (630 samples x ~ 420 000 CpGs) 481 in around 8 hours on a standard 4-core processor, and runtime scales linearly downward with a larger 482 core/thread count. Notably, this work is not intended to produce an optimal method or perform extensive 483 benchmarking against other methods, neither is it aimed at evaluating the feasibility of using different data 484 sources for deriving input purity estimates. Instead, this work serves to illustrate a way of accomplishing 485 the goal of reducing the impact of the non-tumor compartment on methylation estimates derived from the 486 Illumina 450/850K arrays.

487
Our search for a method capable of adjusting methylation estimates to account for tumor purity started with 488 our observations in the BRCA1 gene locus [24]. Through this work it has become evident that the effect of 489 non-tumor methylation is pervasive in bulk tumor data, and that thousands of loci are affected in any given 490 high-throughput methylation experiment. The RB1 gene represents another bona-fide tumor suppressor 491 gene that has long been known to become inactivated through DNA methylation in cancer [25]. High impact 492 genes such as RB1 and BRCA1 are epigenetically inactivated at high frequencies which is why they were 493 detected in the early days of molecular cancer research. However, epigenetic silencing of key tumor 494 suppressor genes may be infrequent but critical when appearing. An illustrative example is homologous 495 recombination deficiency (HRD) in breast cancer. While promoter hypermethylation of BRCA1 has been 496 reported as the most frequent cause of HRD in TNBC, a substantial number of patients' tumors still lack a 497 known inactivation mechanism (driver alteration) [26]. In these tumors the HRD phenotype is likely 498 conferred by alterations in less penetrant genes or through polygenic interactions, as illustrated by the 499 infrequent (2%) promoter hypermethylation of RAD51C in TNBC that however confers a genetic HRD 500 phenotype similar to BRCA2 inactivation [26]. Improved processing of DNA methylation data can thus be 501 a critical component in the search for novel but infrequently deactivated tumor suppressor genes. Another 502 important issue made more addressable by our method is the question of which and how many pure 503 epigenetic phenotypes exist in e.g., breast cancer, and what are the base epigenotypes on top of which the 504 intrinsic subtypes reside. Epigenetic profiling of pure (or purified) methylomes may therefore provide novel Conclusions 511 We present a conceptual method and an algorithm for correcting large-scale DNA methylation data for the 512 influence of the TME on global methylation estimates. Our method uses a flexible and simple mixture 513 modeling approach to identify tumor sample populations differentially influenced by non-tumor 514 background and correct for this effect on a global basis. Our method also generates accurate estimates of 515 the ground-state methylation of the non-tumor compartment and does not rely on available normal samples. 516 We expect that our approach can be developed further and integrated into pre-existing methods to improve 517 the capacity of these for de-noising bulk methylation data and allow for more unbiased epigenomic 518 analyses. Additionally, we believe that in-depth analysis of loci that exhibit a BRCA1-like methylation 519 pattern could yield novel leads in the search for tumor suppressor genes frequently inactivated by DNA 520 methylation and that purified methylomes may provide valuable insights into the question of pure 521 epigenotypes in cancer.

524
The authors would like to thank Dr Dominik Glodzik at Harvard Medical School for suggesting the use of 525 FlexMix as the basis for the proposed approach. We would also like to thank Dr Shamik Mitra at Lund 526 University Department of Clinical Genetics for providing valuable input to the manuscript. We 527 acknowledge the work of The Cancer Genome Atlas (TCGA) project and the National Cancer Institute 528 Genomic Data Commons (NCI GDC) in making the underlying data available. Agree to be accountable for all aspects of the work: All authors