The Multi-State Epigenetic Pacemaker enables the identification of combinations of factors that influence DNA methylation

Epigenetic clocks, DNA methylation based predictive models of chronological age, are often utilized to study aging associated biology. Despite their widespread use, these methods do not account for other factors that also contribute to the variability of DNA methylation data. For example, many CpG sites show strong sex-specific or cell type specific patterns that likely impact the predictions of epigenetic age. To overcome these limitations, we developed a multidimensional extension of the Epigenetic Pacemaker, the Multi-State Epigenetic Pacemaker (MSEPM). We show that the MSEPM is capable of accurately modeling multiple methylation associated factors simultaneously, while also providing site specific models that describe the per site relationship between methylation and these factors. We utilized the MSEPM with a large aggregate cohort of blood methylation data to construct models of the effects of age, sex and cell type heterogeneity on DNA methylation. We found that these models capture a large faction of the variability at thousands of DNA methylation sites. Moreover, we found modeled sites that are primarily affected by aging and no other factors. Among these, those that lose methylation over time are enriched for CTCF transcription factor chip peaks, while those that gain methylation over time are enriched for REST transcription factor chip peaks. Both transcription factors are associated with transcriptional maintenance and suggest a general dysregulation of transcription with age that is not impacted by sex or cell type heterogeneity. In conclusion, the MSEPM is capable of accurately modeling multiple methylation associated factors and the models produced can illuminate site specific combinations of factors that affect methylation dynamics.


25
DNA methylation, the addition of a methyl group to the fifth carbon of the cytosine 26 pyrimidine ring, is associated with the topological organization of the cellular genome, 27 gene expression and the state of a cell. Within a population of cells the methylation 28 pattern at certain sites can change predictably with the age of the individual from 29 which the cells are drawn. This predictable nature of DNA methylation has led to the 30 development of accurate DNA methylation based predictive models for age and health, 31 termed epigenetic clocks. The difference between the predicted and the expected 32 epigenetic age given an individual's chronological age has been interpreted as a measure 33 net penalty,λ1 and λ2. Methylation sites that increase model error and are influenced 48 by other relevant factors such as smoking or obesity, may be discarded during model 49 fitting, thus limiting the ability of this approach to account for the effects of these 50 extraneous factors on epigenetic aging. 51 As an alternative to penalized regression based methods we previously devel- 52 oped an evolutionary based model for epigenetic dynamics, the Epigenetic Pacemaker 53 (EPM) [13,14]. The EPM attempts to minimize the difference between observed and 54 predicted methylation values amongst a collection of sites through the implementa- 55 tion of a conditional expectation maximization algorithm [15]. Under the EPM the 56 observed methylation status of a collection of sites is modeled linearly with respect to 57 an input factor of interest, such as age. A hidden epigenetic state, that is related to 58 the initial factor, but not necessarily linearly, is learned through the course of model 59 fitting. The EPM can capture the non-linear relationship between methylation and 60 age [16] and outputs an interpretable model for each site. However, both the EPM 61 and regression based methods suffer from the same limitation, which is that they are 62 limited to a single trait predicted by, or used to model, observed methylation patterns. 63 In reality, the observed methylation landscape is likely impacted by a variety of factors 64 that act simultaneously to produce the observed methylome of an individual.
to an individual, termed epigenetic factors. Under this model epigenetic factors are related to observable individual factors, such as chronological age, sex and cell types, 81 but may be transformed relative to observable factors. The epigenetic age factor, for 82 example, often has a non-linear relationship with the observed age [16]. The MSEPM 83 learns the appropriate transformation during model fitting to describe the observed 84 methylation status linearly in terms of the epigenetic age factor (and not linearly with 85 age). Given a site i and individual j the observed methylation status can be modeled  j,k , where γ k is the transformation between a factor of magnitude qn,j 119 and the epigenetic factor. In practice the value of the f k,j is often unknown and the 120 association between methylation status and f k,j is inferred through qn,j. We simu-121 lated individuals whose methylation is determined by four factors and their associated 122 epigenetic factors: a uniform factor approximating age with a non-linear association 123 with methylation status (q ∼ U(0, 100), sAge = q 0.5 , Figure 1A  previously described [34,35]. The aggregate data spanned a wide age range (0.0 -99.0 172 years, Figure 3A), contained more predicted females (n = 3392) than males (n = 2295, Figure 3B) and produced reasonable cell type abundance estimates ( Figure 3C). 174 We trained MSEPM models using data assembled from four GEO series [20,22,29,36] Figure 5A).

215
The site clusters largely conform to underlying biological expectations. Cluster 216 one contains sites that are wholly associated with sex status and localized to the X 217 chromosome (Supp. Table 1) and is enriched for peaks of transcription factors asso-218 ciated with sex specific regulation such as MAZ [40]. Clusters nine and ten contain 219 sites whose methylation status is largely driven by CT-PC1, and are enriched for tran- is known to increase at sites where methylation is lost during aging [43]. Cluster four 226 is enriched for STAT3 whose activation during exercise is age dependent [44,45]. Clus-227 ter seven is associated with the accumulation of methylation with age and is enriched ulation of TGF-β has been linked to reduced skeletal muscle regeneration [46,47] and 232 SMAD4 polymorphisms are associated with longevity [48]. REST is a transcriptional 233 repressor of neuron specific genes in non-neuronal cells [49,50]. REST expression is 234 upregulated in aged prefrontal cortex tissue and the absence of REST expression is 235 associated with cognitive impairment [51] and cellular senescence in neurons [52].

283
-The k untransformed input traits are used as the initial guess of S j,k for 284 the first model iteration.

285
• Update S j,k to minimize the RSS cost function, RiSj)) 2 ), using gradient descent with fixed ri site parameters.

373
The data supporting these findings are openly available at GEO under the series