A Deep Survival EWAS approach estimating risk profile based on pre-diagnostic DNA methylation: An application to breast cancer time to diagnosis

Previous studies for cancer biomarker discovery based on pre-diagnostic blood DNA methylation (DNAm) profiles, either ignore the explicit modeling of the Time To Diagnosis (TTD), or provide inconsistent results. This lack of consistency is likely due to the limitations of standard EWAS approaches, that model the effect of DNAm at CpG sites on TTD independently. In this work, we aim to identify blood DNAm profiles associated with TTD, with the aim to improve the reliability of the results, as well as their biological meaningfulness. We argue that a global approach to estimate CpG sites effect profile should capture the complex (potentially non-linear) relationships interplaying between sites. To prove our concept, we develop a new Deep Learning-based approach assessing the relevance of individual CpG Islands (i.e., assigning a weight to each site) in determining TTD while modeling their combined effect in a survival analysis scenario. The algorithm combines a tailored sampling procedure with DNAm sites agglomeration, deep non-linear survival modeling and SHapley Additive exPlanations (SHAP) values estimation to aid robustness of the derived effects profile. The proposed approach deals with the common complexities arising from epidemiological studies, such as small sample size, noise, and low signal-to-noise ratio of blood-derived DNAm. We apply our approach to a prospective case-control study on breast cancer nested in the EPIC Italy cohort and we perform weighted gene-set enrichment analyses to demonstrate the biological meaningfulness of the obtained results. We compared the results of Deep Survival EWAS with those of a traditional EWAS approach, demonstrating that our method performs better than the standard approach in identifying biologically relevant pathways.

DNA methylation (DNAm) is a chemical modification that consists of the addition of a 2 methyl group via a covalent bond to the cytosine ring of DNA, in correspondence of 3 CpG sites (CpGs), and a large body of evidence has demonstrated that CpG islands 4 hypermethylation is implicated in loss of expression of a variety of critical genes in 5 cancer [1]. 6 7 These alterations can be detected both in the target tissue (e.g., cancer biopsy vs 8 tumor-free adjacent tissue) and in blood-derived DNA. DNAm dysregulation in target 9 tissue are likely the effect of the disease rather than vice versa [2], whereas DNAm 10 alterations in blood are commonly used as biomarkers of long-term exposure and insults 11 to the DNA which includes variability related to genetic predisposition or individual 12 response to risk factors. The above suggest the possibility to use blood-derived DNAm 13 profiles as new biomarkers for cancer risk stratification and possibly, early detection. 14 Investigating DNA methylation data from blood samples is of particular interest since it 15 is a convenient tissue to assay for constitutional methylation and its collection is In this work, we focus on the identification of blood DNAm profiles predictive of TTD, 24 with the aim to improve the reliability/reproducibility of the results, as well as their 25 biological meaningfulness. Indeed, previous studies based on pre-diagnostic blood 26 DNAm, either ignore the explicit modeling of TTD as in a survival analysis setting, or 27 mostly provide inconsistent results (e.g. [3] and references therein). This unsatisfactory 28 outcome may be induced by the limitations of the most traditional approaches for the 29 analysis of whole-genome DNAm data, i.e., Epigenome-Wide Association Studies 30 (EWAS). Indeed, EWAS analyses traditionally comprise multiple independent tests of 31 individual CpG sites or regions, seeking for significant associations by imposing p-value 32 thresholds corrected for multiple comparisons. by environmental exposures and lifestyle behaviors [4]. Additionally, the context of 40 pre-diagnostic blood DNAm carries further complexities due to the (iv) very low 41 signal-to-noise ratio of differential methylation, as both cases and controls are healthy at 42 the time of DNAm collection. Lastly, (v) these methods based on independent testing 43 do not account for any of the complex and potentially non-linear interactions that might 44 exist between CpG sites or the combined effect of multiple loci together on the 45 phenotype. Indeed, findings from previous studies [5], suggest the need for an 46 epigenome-wide approach, that assigns individual parameters while accounting for sites' 47 combined effect on the phenotype. We refer to this set of parameters as an effects between very large amounts of input covariates [6]. Some recent works demonstrated 60 the usefulness of AutoEncoders (AE), Variational AE (VAE) and DL models to obtain 61 DL-based EWAS (henceforth, Deep EWAS) [7][8][9], especially when coupled with 62 post-hoc Explanation Methods (EM) [6]. EM like SHapley Additive exPlanations 63 (SHAP) [10,11] try to overcome the "black box" aspect of these complex models 64 assigning a contribution score (i.e., an importance weight) to each input feature, based 65 To enhance the extraction of relevant information from blood DNAm data, we defined 149 our methodological approach, i.e. the Deep Survival EWAS algorithm, as a set of steps 150 tailored to face the aforementioned facets of these complex dataset and research settings. 151 In Fig 1, we represent the visual schema of the overall procedure.

152
In brief, as a first step, we subdivide the study population between cases (i.e.,  to infer an effect profile associated to the blood-derived DNAm CpG Islands in the 179 dataset. As many ML or DL based algorithms, our approach comprises several building 180 blocks that require specific choices in terms of hyperparameters and/or implementation 181 details, that need to be optimized to provide a robust estimation of the desired effect 182 profiles. Indeed, the better the underlying K models will perform in predicting the We started from the EPIC BC cohort, equally split between cases (i.e., patients enrolled healthy but diagnosed with BC within the study time frame) and matched controls (i.e. patients matching cases at baseline, that were not diagnosed with BC within the study time frame). (B) The first step is feature aggregation via hierarchical clustering, exploiting CpG Islands continuous β values of the population of cases to infer the J clusters of features. The same clustering structure is then applied to both cases and controls, grouping their CpG Islands accordingly. (C) The cases population is exploited to generate K independent and randomly split training and test sets, each of them with 80% patients in training set and 20% patients in the test set. (D) Each of the K splits goes through step D independently. In particular, the k-th training set is used to train a Deep Survival Model, that then is used to estimate SHAP weights profiles (w k ) on the k-th test set using the controls' population as background data. (E) After generating independently K sets of weights profiles, they are aggregated to obtain the final estimation of the effects profile for BC TTD.
survival outcome, the more meaningful the feature importance weights derived by 184 SHAP will be. Concurrently, a stable feature importance ranking suggests that the K  This island belongs to the cluster that is aggregated into feature F 120 . The plot reports the Log-Rank test p-value for the difference between the two groups; the lower part of the plot reports the count of subjects in High DNAm and Low DNAm populations for CpG Island 9:34518796-34619343, according to TTD.
Section, whereas complete results are available in Supporting Information, S1 Table. 198 The final results presented here, corresponding to the chosen best Deep Survival EWAS 199 The boxplot shows a decreasing trend of the feature value with increasing TTD. Note 214 that F 120 groups 20 CpG Islands (cf.  Table 1. CpG Islands in Feature 120. List of CpG Islands agglomerated in F 120 , therefore assigned to the highest effect weight. As a further validation, for each CpG Island pertaining to F 120 , we extracted two 217 sub-population of cases: the hypermethylated population, composed of patients with β 218 value above the 75 th percentile for that CpG island, and the hypomethylated population, 219 composed of patients with values below the 25th percentile. Then, we compared the risk 220 profiles of the two groups via Kaplan-Meyer (KM) test. In Fig 3,  F 120 are reported in S1 File). As expected, modeling TTD of BC cases only, both KM 225 tend to 0 as t increases. However, it is clear from KM and test results how methylation 226 on those sites is associated with TTD for those patients. It is crucial to note that,  analysis) is that of the feature F j the CpG site belongs to. In Fig. 4 we present the 242 KEGG pathways enriched according to the WKS procedure. We found 12 pathways 243 with empirical p-value (1,000 permutations) lower than 0.05, being 'Pathways in Cancer' 244 the most significant (empirical p < 0.0001). Interestingly, all the identified pathways 245 were previously described as dysregulated in breast cancer, including 'Human Papilloma 246 virus infection' [19] and 'Epstein-bar virus infection' [20] in addition to well-known BC 247 related pathway like 'Breast Cancer', 'PI3K/Akt/mTOR signalling pathway', 'Calcium 248 Signaling pathway' and 'Mineral absorption' [21,22], and some cancer generic pathways 249 like 'Signaling pathways regulating pluripotency of stem cells', 'Proteoglycans in cancer', 250 'Neuroactive ligand-receptor interaction', and 'Cytokine-cytokine receptor interaction'. 251 As mentioned, the inference of feature importance, and the derived effects profile, can 252 be influenced by the chosen background.  CoxPH KT-stability, the weights rank was derived from the regression parameters (i. European study on diet and cancer and has been previously described elsewhere [25].

427
The Italian component of EPIC (EPIC-IT) [26]  pre-processing and sample filtering can be found in S1 Appendix. previously described association of the total number of DNAm outliers with BC risk [5]. 452 After filtering on EZH2 and SUZ12 CpG sites, we decided to perform an additional The first step of our approach comprises a hierarchical feature clustering that is 474 spatially independent. This technique is similar to a hierarchical agglomerative 475 clustering procedure, but recursively merges features instead of samples.

476
Specifically, given the initial input data X (cases) ∈ R N ×Q , where N is the total number of cases and Q the total number of CpG Islands, to identify J clusters of CpG Islands we exploited the Euclidean Distance with Ward linkage. Then, we computed the representative value of the j-th agglomerated feature as Where C j is the j-th cluster and x  input to the network for patient i is its baseline data in terms of DNAm features (F (i) ), 487 while the output h θ (F (i) ) is a single node with a linear activation which estimates its 488 log-risk function. Let T be the times to disease and E the event indicator, the objective 489 function to optimize becomes:  data) repeated on 10 random data splits, comparing the average results (cf. S1 Table). 587 As an additional decision support resource, we compared the distributions of CpG Controls, (c) All controls, (d) All controls with cases). The first column reports the 620 KEGG pathway, followed by pathway code and empirical p-value (last column).