A systematic evaluation of 41 DNA methylation predictors across 101 data preprocessing and normalization strategies highlights considerable variation in algorithm performance

Background DNA methylation (DNAm) based predictors hold great promise to serve as clinical tools for health interventions and disease management. While these algorithms often have high prediction accuracy and are associated with many disease-related phenotypes, the reliability of their performance remains to be determined. We therefore conducted a systematic evaluation across 101 different data processing strategies that preprocess and normalize DNAm data and assessed how each analytical strategy affects the reliability and prediction accuracy of 41 DNAm-based predictors. Results Our analyses were conducted in a large EPIC DNAm sample of the Jackson Heart Study (N=2,053) that included 146 pairs of technical replicate samples. By estimating the average absolute agreement between replicate pairs, we show that 32 out of 41 predictors (78%) demonstrate excellent test-retest reliability when appropriate data processing and normalization steps are implemented. Across all pairs of predictors, we find a moderate correlation in performance across analytical strategies (mean rho=0.40, SD=0.27), highlighting significant heterogeneity in performance across algorithms within a choice of an analytical pipeline. (Un)successful removal of technical variation furthermore significantly impacts downstream phenotypic association analysis, such as all-cause mortality risk associations. Conclusions We show that DNAm-based algorithms are sensitive to technical variation. The right choice of data processing and normalization pipeline is important to achieve reproducible estimates and improve prediction accuracy in downstream phenotypic association analyses. For each of the 41 DNAm predictors, we report its test-retest reliability and provide the best performing analytical strategy as a guideline for the research community. As DNAm-based predictors become more and more widely used, both for research purposes as well as for clinic applications, our work helps improve their performance and standardize their implementation.


Results
To evaluate how unwanted technical variation in DNAm data impacts the performance of DNAm-based predictors, we implemented 101 data processing and normalization strategies in the JHS dataset. For each analytical strategy, which we will refer to as a "pipeline", we then extracted beta values and calculated estimates of 41 DNAm-based predictors in (1) JHS data1: a sample of 146 technical replicate pairs and (2) JHS data 2: a general sample of 1,761 nonreplicate samples that do not overlap with the individuals in the replicate dataset. Figure 1 shows an overview of our analysis plan. In the sample of technical replicates, we quantified the average absolute agreement between replicate pair values (i.e. reliability) by means of the ICC for each DNAm predictor and each pipeline separately (41 predictors x 101 pipelines = 4,141 ICC analysis). We also generated DNAm estimates in the general sample. This allowed us to correlate the ICC of a pipeline that was estimated in the sample of replicates with predictor estimates in the independent general JHS sample. DNAm analyses are conducted using DNAm EPIC array samples in JHS. JHS includes a significant number of technical replicate pairs thereby allowing for a careful investigation of how the removal of unwanted technical variation impacts DNAm algorithm performance across 101 data processing pipelines. JHS has also collected information on disease-related phenotypes, including mortality status after follow-up. This allowed us to assess how removal of technical variation in DNAm predictor estimates by a data processing pipeline impacts downstream phenotypic association analyses.

There is significant heterogeneity in pipeline performance across predictors
Among the 41 best performing pipelines (i.e. the pipeline with the largest ICC value for each of the 41 predictors), there are 27 different data processing and normalization strategies, which highlights significant heterogeneity in choice of best pipeline between predictors. As ICC differences between pipelines of a predictor can be small and pipelines beyond the highest ICC may also be informative, we calculated the median rank across the 41 predictors for each of the 101 pipelines (see Table S6). The pipeline with the best median rank (at 15) across predictors is the "ENmix: oob_relic_q1_rcp". While this observation suggests this pipeline yields the best average performance across predictors, it still scored average to low for multiple predictors. For example, for the BMI predictor the "ENmix: oob_relic_q1_rcp" pipeline had one of the lowest ranks (ICC = 0.89, rank = 91). It is also important to note that a data processing pipeline can also introduce more spurious variation instead of removing technical variation. That is, the raw data pipeline that does not apply any data processing and normalization yielded a median rank of 85 (range: 7 to 100). For the CD4T and CD8 naive predictors, the raw data pipeline ranked as the seventh best performing pipeline highlighting that most pipelines perform worse than no data processing at all for these two predictors. The "Minfi: raw_quantile_strat" and "Minfi: illumina_bg_quantile_strat" had the lowest median rank of 100 and yielded the lowest reliability for 17 and 9 predictors, respectively (Table S5).
To assess the concordance in pipeline performance across predictors more formally, we calculated the rank correlation in pipeline reliability between all pairs of predictors. In Figure 3 we visualize the result of this analysis via a clustered correlation heatmap. Shown is a clustered correlation heatmap of pipeline reliability concordance between predictors. The color coding depicts Spearman's rho and clustering is performed using hierarchical clustering. Only correlations with a P-value < 0.01 are colored.
For some predictors the ranking in pipeline performance is very similar. For example, the GrimAge, Smoking_Lu, Cystatin_C, and GDF_15 predictors show strong concordance (mean rho = 0.92). As noted, these four predictors were developed in the same dataset and the Cystatin_C, GDF_15, and Smoking_Lu estimates are included in the GrimAge algorithm. Across all pairs of predictors, we find a moderate correlation in pipeline performance (mean rho=0.40, SD=0.27). Some predictors however show little to no concordance with other predictors. The ranking of pipelines of the BMI and NK predictor, for example, have a mean rank correlation of 0.14 (SD=0.20) and 0.21 (SD=0.24), respectively, with that of other predictors. For a handful of predictor-pairs we even observe a negative correlation, suggesting that pipelines that yield high reliability for one predictor yield low reliability for another. Pipeline performance of the BioAge4HAStatic and Mono predictors for example have a correlation of -0.45 (P=2.1e-06). Our findings thus far show that specific pipelines are more effective in removing unwanted technical variation for a predictor and that significant heterogeneity exists in pipeline performance across predictors.

The choice of data processing pipeline impacts downstream analysis of predictors
Next, we evaluated if the performance of a pipeline can also affect downstream phenotypic analyses of a predictor. For these analyses, we used the general JHS data 2 sample. For each pipeline, we calculated the mean and standard deviation (SD) of the predictor estimate distribution in the general JHS sample. For each predictor, we then correlated these two statistics (i.e., the mean and SD) with the ICC estimates of the pipelines obtained in the technical replicate sample. We find that the choice of pipeline has a significant impact on the distribution of the predictor estimate. Of the 41 predictors, 33 (80%) are significantly impacted on the distribution of their estimates after Bonferroni correction P<0.0012). For 22 predictors (54%), we find a significant correlation for both the mean and standard deviation. For DNAmTL, we, for example, observe a negative correlation between the performance of a pipeline and the mean of the estimate distribution (rho=-0.71, P<2.2e-16) and a positive correlation with the standard deviation of the estimate distribution (rho=0.79, P<2.2e-16). The best performing pipeline yields a mean estimate of 6.83 kilobases (SD=0.34). The least performing pipeline yields a mean estimate of 7.20 kilobases (SD=0.29). This shows that the more effective a pipeline is in removing technical variation, the lower the DNAm-based predicted estimate of telomere length and the larger the variation between individuals. The direction of effect of the relationship between pipeline performance and the mean and standard deviation of the DNAm variables varies between predictors as well. HorvathAge, for example, is impacted on its standard deviation (rho=0.39, P=5.6e-05) but not on the mean (rho=-0.10, P=0.27). HDLratio is impacted on its mean but unlike DNAmTL shows a positive correlation with pipeline performance (rho=0.38, P=9.8e-05). HDLratio is not impacted on the standard deviation of its distribution (rho=0.00, P=0.96). Correlation plots and correlation statistics of all predictors are shown in Supplementary Note 1. A full overview of test statistics can be found in Table S7.
Several DNAm age predictors are known to predict all-cause mortality risk. We therefore examined if pipeline performance also impacts their association with mortality risk. We focus on four predictors: HorvathAge, PhenoAge, GrimAge, and ZhangAge. Each predictor has different training characteristics and captures a different aspect of biological age and/or mortality risk [41]. ZhangAge is a blood-based DNAm clock and was developed on the largest training dataset and shown not to be associated with mortality risk despite its improved precision [8]. We find that pipeline performance significantly impacts downstream analysis for all four predictors ( Figure 4). For HorvathAge, pipelines that achieve greater reliability also achieve a greater correlation between HorvathAge and chronological age (rho=0.47, P=1.8e06). Better performing pipelines furthermore achieve greater power to predict all-cause mortality (rho=0.52, P=3.3e-08). For PhenoAge, we did not find an effect on the correlation with chronological age but did find the survival analysis to be significantly impacted. Better performing pipelines achieve greater power for PhenoAge (rho=0.68, P<2.2e-16) but also a smaller hazard ratio (rho=-0.39, P=5.3e-05), suggesting that unsuccessful removal of technical variation in DNAm data can inflate the magnitude of mortality risk. In contrast to our findings for HorvathAge, we found that better performing pipelines produced a lower correlation with chronological age for GrimAge (rho=-0.67, P < 2.2e-16). Similar to PhenoAge, we found that pipelines that achieve greater reliability yield more significant associations with mortality for GrimAge (rho=0.75, P<2.2e-16) but also a smaller hazard ratio (rho=-0.62, P<2.2e-16). The most reliable pipeline reports a significant hazard ratio of 1.12 (SE=0.01, P=1.60e-30), which verifies GrimAge as a strong predictor of all-cause mortality, especially when spurious technical variation is appropriately accounted for. For ZhangAge, we found no impact on the correlation with chronological age. Better performing pipelines produced smaller and less significant effects in associations with all-cause mortality. The most reliable pipeline produced a non-significant hazard ratio of 1.10 (SE=0.05, P=0.06), confirming that ZhangAge does not predict mortality risk. Taken together, using the general JHS sample, we demonstrate how pipeline performance has a significant impact on downstream phenotypic analysis of DNAm predictors.

Predictor reliability is inversely associated with sample size of the training dataset
To assess if specific features of the predictors are associated with higher reliability, we investigated the number of CpG probes and the sample size of the training dataset in relation to the ICC of the best performing pipeline (see Figure S3). Using predictors for which such information was available, we find that the sample size of the dataset in which a predictor was developed is inversely associated with the observed predictor reliability (N=37, rho=-0.39, P=0.02). We did not find a significant association between the number of predictor CpG probes and reliability of a predictor (N=37, rho=-0.21, P=0.20).

A smaller number of replicate pairs can be used to measure reliability
In our analyses, we made use of a large number of replicate pairs. We therefore assessed how sample size affected our measure of reliability and if a smaller number of replicate pairs yield similar findings. Across reliabilities from all pipelines and predictors, we observe good concordance (rho > 0.94) with as low as ten replicate pairs compared with measures obtained from larger sample sizes ( Figure S4). Differences however exist between predictors with some predictors still requiring a larger number of replicate pairs (Supplementary Note 2).

Discussion
DNAm-based predictors are emerging as powerful new methods to study health and disease, but little is known about the reliability of the estimates they produce. To investigate their performance, we carried out a systematic evaluation of 41 predictors across 101 data processing and normalization strategies and assessed to what degree algorithm performance is impacted by (un)successful removal of technical variation. Leveraging a large technical replicate sample in the JHS, we demonstrate that the choice of analytical pipeline has a significant impact on the reliability of predictors as well as on the outcomes of downstream phenotypic analyses. We highlight that specific pipelines are more effective in removing unwanted technical variation for a predictor but that significant heterogeneity exists in pipeline performance across predictors. Pipelines of the ENmix package achieved the highest reliability and were most frequently represented among the best performing pipelines. As research on DNAm-based predictors will continue to grow, our work provides best practices for the research community to help standardize their implementation and improve their performance.
To quantify method performance, we used a type of intraclass correlation that measures test-retest reliability by assessing the degree of absolute similarity between technical replicate pairs. Guidelines from reliability research suggest that ICC values less than 0.5 are indicative of poor reliability, values between 0.5 and 0.75 indicate moderate reliability, values between 0.75 and 0.9 indicate good reliability, and values greater than 0.90 indicate excellent reliability [40]. The ICC range of best performing pipelines across predictors was 0.83-0.99, indicating good to excellent reliability for these predictors. For 32 out of 41 predictors (78%), we found excellent reliability (ICC > 0.9) for at least one data processing pipeline. Several predictors show a reliability close to 1, which demonstrates that repeated collections of DNAm data yield almost the same predictor estimate and highlights their potential as a biomarker for health-related outcomes. Among predictors with high reliability are predictors of mortality risk, smoking behavior, blood cell types, and cancer risk. Demonstrating internal validity for these DNAm tools is important for research purposes but even more so for their potential utilization for health management and disease prediction in the clinic. GrimAge, a strong predictor of all-cause mortality, for example, has the highest test-retest reliability of 0.994. This finding demonstrates excellent test-retest reliability based on technical replicates from the same biological sample. It remains an open question if the measured reliability translates to repeated measures of DNA samples extracted from different blood draws at the same time point or across time points. The analytical framework we applied can however be easily extended to study design of other types of (biological) replicates. Establishing method reliability in other contexts of technical and biological variation is an important next step for future research.
We found that the choice of analytical pipeline is essential as multiple data processing strategies produced poor reliability (ICC<0.5) for several predictors. For some predictors, like for CD4T and CD8 naive T cells, using the raw data achieves higher reliability than most data processing pipelines. This highlights that analytical decisions on how to best prepare DNAm data require careful consideration as certain data processing and normalization steps can even reduce algorithm performance. Among the best performing pipelines of each predictor, we found significant heterogeneity across predictors. That is, there are 27 unique pipelines across the 41 predictors. On average, pipelines of the Enmix package achieved the highest reliability most frequently. While there is no one optimal pipeline to use for all predictors, several data processing steps stand out as producing high reliability for multiple predictors. For example, almost half of the best performing pipelines make use of the RELIC dye-bias correction method. RELIC uses the information between pairs of internal normalization control probes to correct for differences between color channels that measure intensity levels of the array [28]. The EPIC array contains 85 pairs of controls that target the same DNA region in housekeeping genes and contain no underlying CpG sites. RELIC uses the relationship between the pairs of controls to correct for dye-bias on intensity values for the whole array. Another data processing step that produced high reliability is the RCP probe type-bias correction method. 31 out of 41 of the best performing pipelines make use of this data processing step. RCP uses the existing correlation between pairs of nearby type I and II probes to adjust the beta values of all type II probes [27]. Both RELIC and RCP have been shown to reduce technical variation in DNAm data and are implemented in the ENmix package. While both approaches are effective in removing unwanted technical variation, we still recommend using the best performing pipeline for a specific predictor as reported in Table  1 as RELIC and RCP both show heterogeneity in performance across predictors.
The choice of analytical pipeline does not only impact the test-retest reliability of a predictor but also significantly affects downstream phenotypic analyses. We show that 80% of predictors are impacted on the mean and/or standard deviation of their distribution in the general JHS cohort. We furthermore analyzed DNAm clocks and showed that the strength of correlation between DNAm age and chronological age is affected in opposite directions for HorvathAge and GrimAge. While the correlation with chronological age becomes stronger with better performing pipelines for HorvathAge, the correlation becomes weaker for GrimAge. For DNAm clocks that are shown to be associated with mortality risk, successful removal of technical variation produced smaller hazard ratios but more significant associations. This highlights that not appropriately accounting for technical variation can decrease statistical power and inflate risk estimates for these predictors. It also shows that despite the narrow distribution of reliability estimates for these predictors, for example GrimAge has an ICC range of 0.921-0.994 indicating excellent reliability across all pipelines, the choice of pipeline still impacts downstream association analyses. We note that in our association analysis with mortality risk, we adjusted for chronological age, and still found that the choice of pipeline influences the outcome of the analysis. This is different from findings of a previous study that reported that the choice of pipeline influences the mean of DNAm age but not the DNAm age acceleration residual [42]. This study however only compared three data processing and normalization strategies and could have missed this effect as it did not perform a systematic evaluation across many pipelines. Finally, we confirm that ZhangAge, a DNAm clock developed in the largest blood based DNAm dataset, does not associate with mortality risk.
We also investigated if specific characteristics of a predictor impacted the measured reliability. We found that the sample size of the training dataset has a moderate inverse relationship with the reliability of a predictor. This suggests that predictors developed in larger training datasets are more sensitive to technical variation than predictors developed in a smaller dataset. This relationship could for example arise if larger training datasets on average have more technical factors that are not properly accounted for. The ZhangAge predictor, however, was developed in the largest training dataset and shows the second to highest reliability of all predictors we investigated. This indicates that other factors in addition to sample size of the training dataset are likely to play a role as well. ZhangAge was developed using 65 training sets across 14 cohorts, where each training set had a certain number (ranging between 1 and 13) of cohorts randomly sampled from the 14 cohorts [8]. This strategy is, as far as we know, unique to this predictor and may have helped select for CpG probes that are less impacted by technical variation due to its many training sets of different randomly assigned cohort compositions. As training datasets with large sample sizes are essential to developing more accurate DNA-based predictors, a strategy to randomize the potential effect of technical factors, like was implemented for the development of ZhangAge, could be worthwhile to consider for new predictors as well. We did not find a significant relationship between the number of CpG probes and the observed reliability of a predictor.
Our study comes with limitations. First, we measured reliability using technical replicate in one study. A different cohort or different types of repeated measures may yield different outcomes. Ideally, one would use study-specific replicate samples and assess if similar best practices are achieved or if alternative strategies are more appropriate to remove technical variation most optimally for that specific study. If future studies have the means to include replicate samples, they should aim to include at least ten replicate pairs. We determined that for most predictors a sample size of ten replicate pairs can already provide meaningful insights into their reliability. Second, several predictors were not fully compatible with the EPIC array platform. Predictors that were developed on older DNAm array platforms showed lower reliability. Missing probes could have affected the outcome of our analysis. Having said that, as the older 27K and 450K DNAm array platforms are discontinued, any future application of predictors that are not fully compatible with the EPIC array will face a similar challenge.

Conclusion
In summary, this study demonstrates that considerable variation exists in the performance of DNAm-based predictors depending on the data processing and normalization strategy implemented. Analytical pipelines that best remove unwanted technical variation in DNAm data achieve excellent test-retest reliability for most predictors thereby demonstrating their potential as biomarkers for health-related outcomes. DNAm is an important tool to study health and disease. As the number of DNAm predictors continues to rise, understanding how best to improve and implement these algorithms will be essential for downstream clinical applications.

Cohort descriptions
The Jackson Heart Study is a large observational study of African American individuals from the Jackson, Mississippi (USA), metropolitan area [34]. JHS seeks to study the causes and disparities in cardiovascular health and related phenotypes in African Americans. Data and biological materials have been collected from 5,306 participants. For a subset of the cohort, peripheral blood samples were collected at baseline and subsequently used to quantify DNA methylation using the Illumina Infinium MethylationEPIC BeadChip that covers over 850,000 CpG sties. These samples have been included in previous DNAm studies [15,35]. See Table S1 for cohort characteristics. In our analysis, we included individuals for which DNAm data, phenotypic variables, and mortality data were available (N=1,909, 62.2% women, mean (SD) of age = 56.1 (12.4) years). For 146 individuals, technical replicates were collected. We therefore divided this dataset into two samples; 1) a general cohort sample that does not include technical replicate pairs (N=1,761, 62.6% women, mean(SD) of age=56.0 (12.3)) and 2) a technical replicate sample (N=146, 57.5% women, mean (SD) of age=57.4 (14.0)). Replicate pairs represent DNAm samples that were assayed twice using the EPIC array at separate occasions but originate from the same DNA extraction sample.

Data preprocessing and normalization strategies
To perform a systematic evaluation of available data preprocessing and normalization strategies, we incorporated all methods that are available through the commonly used R packages minfi [36], wateRmelon [23], and ENmix [25]. Within the same package, we implemented all possible combinations of background correction, dye-bias correction, probe correction, and data normalizations as was feasible within the structure of the package. In total, this yielded 101 strategies to prepare DNAm data (Table S2). For each sample, raw intensity values were read from IDAT files into an RGChannelSetExtended object in the R programming environment using the read.metharray() function in minfi. Sample quality control was performed by excluding samples with more than 5% of CpG sites with a detection P-value greater than 0.05 (using the pfilter() function in the wateRmelon package) and by removing outlying samples based on a low median of chipwide (un)methylation across CpG sites (using the getQC() function in minfi). In total, 44 samples were removed. No probes were filtered out to minimize missing probes in downstream DNAm prediction analysis. Data processing and normalization were then executed in batches of 96 samples for computational efficiency. The output of each analytical pipeline was a matrix with beta values for each sample. Table S3 shows an overview of our sample quality control analysis.

DNAm-based predictors
DNAm predictor estimates were calculated using regression coefficients as reported by the corresponding study unless stated otherwise. Custom R scripts were implemented that take as input a matrix of EPIC array beta values and output predicted estimates as a linear combination of weighted CpG methylation levels. For DNAm clocks, inverse transformation was applied to calibrate the DNAm age estimates in units of years, as required by the algorithm. For instance, Horvath's epigenetic clock regressed log-linear age (that leveraged age at 20) on DNA methylation levels and required this calibration step.
Next, we briefly describe the different predictors included in our study. Table S4 presents an overview of predictor characteristics. For full details on each predictor, we refer to their corresponding studies.

DNAm clocks:
The following predictors all output a form of DNAm age and capture a different aspect of biological age depending on characteristics of their training dataset. The Hannum clock uses 71 CpG probes and was developed in a whole blood 450K DNAm dataset of 656 individuals [7]. The Horvath clock was developed using 3,931 multi-tissue and -cell type samples using both 27K and 450K array samples [6]. The Horvath clock uses 353 CpG probes that are present on both arrays. The BioAge4HAStatic clock is an extended measure of the Hannum clock and defined by forming a weighted average of Hannum's estimate with 3 cell types that are known to change with age: naïve (CD45RA+CCR7+) cytotoxic T cells, exhausted (CD28-CD45RA-) cytotoxic T cells, and plasmablasts [17]. The Weidner clock uses 3 CpG and was developed in a 27K DNAm dataset of whole blood samples from 575 individuals [11]. The Lin clock uses 99 CpG and was developed in a dataset of 450K array whole blood samples of 656 individuals [10]. The VidalBralo clock uses 8 CpG probes and was developed in a dataset of 450K array whole blood tissue of 390 individuals [13]. The Skin & Blood clock uses 391 CpG probes and was developed in a dataset of 450K and EPIC arrays of a mixture of human fibroblasts, skin tissue, buccal cells, endothelial cells, whole blood, and cord blood samples (N=896) [9]. The Zhang clock uses 514 CpG probes and was developed in a dataset of EPIC and 450K arrays of 13,566 samples. The majority of the samples were derived from whole blood with a small subsample from saliva tissue [8].
Mitotic clocks: The MiAge calculator uses 268 CpG probes and was developed on 4,020 samples of 8 cancer types using 450K DNAm arrays [19]. MiAge outputs an estimate of mitotic age (total number of lifetime cell divisions) for a given human tissue. The epiTOC calculator was developed in a 450K DNAm dataset of 650 whole blood samples. EpiTOC uses a subset of 385 Polycomb group targets promoter CpGs to predict an estimate of age acceleration in cancer. EpiTOC yields a score, denoted "pcgtAge", as the average DNAm over CpG sites, representing the age-cumulative increase in DNAm at these sites due to putative cell-replication errors [18].

Mortality risk estimators:
The Zhang mortality score is defined by a weighted average of 10 CpGs that are associated with mortality status [16]. The Zhang mortality score predictor was trained on a discovery cohort of whole blood 450K DNAm samples from 954 individuals (N=402 deceased at follow-up) and validated in a cohort of 1,000 individuals (N=231 deceased at follow-up). The second mortality estimator, Levine clock, is a predictor of "phenotypic age", which is a DNAm surrogate of the composite score based on ten mortality markers (9 clinical markers + chronological age) [37]. A training cohort of 456 whole blood samples were then used to identify 513 CpGs predictive of phenotypic age. Only probes available on the 27K, 450K, and the EPIC array platform were used in their analysis. The linear combination of the weighted 513 CpGs is called "DNAm PhenoAge". The third mortality risk estimator isGrimAge from Lu et al., which is defined by a composite score based on seven DNAm-based plasma protein markers, DNAm-based pack years of smoking, chronological age and gender [15]. GrimAge used a training dataset of whole blood samples of 1,731 individuals. The DNA methylation profiling was based on the 450K beadchip but the biomarker was trained on the CpGs present on both the 450K and the EPIC array in order to ensure compatibility for both platforms. GrimAge was calculated using a python executable that was developed by the authors of the original study, which also outputs several DNAm-based plasma protein markers, three blood cell types, and pack years of smoking (see below).
Smoking predictors: Two DNAm-based smoking predictors were included in our analysis. The Lu estimator was trained using a whole blood 450K DNAm dataset of 1,731 individuals in FHS and uses 172 CpGs for prediction, which is a component of GrimAge [15]. We estimated Lu pack years of smoking using the GrimAge python executable. The McCartney estimator was developed using EPIC DNAm data (only probes that are also present on the 450K platform) of 3,444 individuals [20]. The McCartney estimator uses 233 CpGs and outputs, similar to the Lu predictor, the number of pack years of smoking.