Epigenetic scores for the circulating proteome replicate protein- disease predictions as tools for biomarker discovery

Polygenic scores capture the genetic predisposition to complex disease and are powerful tools for risk stratification. However, such diseases are also largely influenced by environmental insults and lifestyle factors, which are thought to be captured by epigenetic modifications. DNA methylation encodes information on the epigenetic landscape of an individual, regulates gene transcription, and can be reflected in the variability of the proteome. Protein biomarkers that predict or correlate with disease onset have been identified across many age-related morbidities; however, the extent of epigenetic influence in these pathways is often unclear. Here, we report a comprehensive epigenome-wide study of the circulating proteome. Using data from four independent cohort studies, we train and test epigenetic scores (EpiScores) for 953 plasma proteins, identifying 109 robust signatures that explain between 1% and 58% of the variance after adjusting for genetic effects. These EpiScores are then linked to the incidence of 12 leading morbidities over 14 years of follow up in the Generation Scotland cohort (n=9,537). After adjustment for common risk factors, 137 EpiScore - incident disease associations are revealed, several of which are confirmed by previously identified associations with measured protein levels. Notably, 21 EpiScores that associate with incident diabetes replicate previous protein associations with incident (n=12 proteins) and prevalent (n=21 proteins) diabetes. Complementary to polygenic scores, these epigenetic scores for protein levels are therefore likely to capture the environmental components of disease risk and can be a valuable resource for disease prediction, stratification and biomarker discovery.


Introduction
Chronic morbidities place longstanding burdens on our health as we age. Stratifying an individual's risk prior to overt symptom presentation is therefore critical 1 . Complex diseases, such as type 2 diabetes, cancer, cardiovascular disease and Alzheimer's disease, are known to be partially driven by genetic factors that underpin the risk of onset [2][3][4] . Polygenic risk scores (PRS) can be useful tools to stratify disease risk [5][6][7] . However, PRS do not assess the environmental part of this risk. Epigenetic modifications are known to associate with complex disease 8,9 . DNA methylation can reflect genetic and phenotypic exposures and, in many cases, may also regulate gene expression in response to physiological stress 10 . Blood-based methylation signatures have been shown to strongly predict allcause mortality and epigenetic scores have been developed as tools to predict smoking status and other lifestyle-related exposures [11][12][13] . Advances in genetic and epigenetic studies of the plasma proteome have uncovered pathways through which DNA methylation may influence both protein expression and disease outcomes 4,10,[13][14][15][16][17] . Low-grade inflammation, which is thought to exacerbate many age-related morbidities, is particularly well-captured through DNAm studies of the plasma proteome 18  Epigenetic predictors have utilised DNAm from the blood to estimate a person's 'biological age' 19 , measure their exposure to lifestyle and environmental exposures 12,20,21 and predict circulating levels of inflammatory proteins 22,23 . A leading epigenetic predictor of biological aging, the GrimAge epigenetic clock, incorporates methylation scores for seven proteins along with smoking and chronological age, and is associated with numerous incident disease outcomes 19,24 . However, focusing on measurements of individual proteins, instead of composite measures may offer greater insights into specific DNAm-protein pathways pertinent in the earliest stages of disease pathogenesis. DNAm scores for Interleukin-6 and C-Reactive protein associate with a range of phenotypes independently of measured protein levels, show more stable longitudinal trajectories than repeated protein measurements, and, in some cases, outperform blood-based proteomic associations with brain morphology 23,25 .
Here, we report a comprehensive association study of blood-based DNAm with proteomics and disease (Figure 1). We train epigenetic scores -referred to as EpiScores -for 953 plasma proteins (with sample size ranging from 725 -944 individuals) and validate them using two independent cohorts with 778 and 162 participants; Demographic information for all cohorts is described in Supplementary Table 1. We regress out genetic effects (protein quantitative trait loci; pQTLs) from the protein levels prior to generating the EpiScores so that they reflect environmental and lifestyle contributions to individual differences. Finally, we examine whether the most robust predictors (n=109 EpiScores) associate with the incidence of 12 major morbidities, over a follow up period of up to 14 years in the Generation Scotland cohort (n = 9,537). We control for common risk factors for disease and assess the replication of EpiScore-disease relationships against previously reported protein-disease associations.

Selecting the most robust EpiScores for protein levels
To generate epigenetic scores for a comprehensive set of plasma proteins, we ran elastic net penalised regression models using protein measurements from the SOMAscan (aptamer-based) and Olink (antibody-based) platforms. We used two cohorts: the German population study KORA (n=944, with a subset of 793 SOMAscan proteins present across train and test cohorts) and the Lothian Birth Cohort 1936 (LBC1936) study (between 725 and 875 individuals in the training cohort, with a total of 160 Olink neurology and inflammatory panel proteins). Demographic information is available for all cohorts in Supplementary Table 1.
Prior to running the elastic net models, we rank-based inverse normalised protein levels and adjusted for age, sex, cohort-specific variables and, where present, cis and trans pQTL effects identified from previous analyses 17, 26 Table 3). Independent test set data were not available for four Olink proteins. However, they were included based on their performance (r > 0.1 and P < 0.05) in a holdout sample of 150 individuals who were left out of the training set. We then retrained these four predictors on the full training sample. A total of 109 EpiScores (84 SOMAscan-based and 25 Olink-based) were brought forward to EpiScore-disease analyses (Supplementary Table 4 28 for each CpG site having genome-wide significance (P < 3.6 x10 -8 ) 29 .

EpiScore-disease associations in Generation Scotland
The Generation Scotland dataset contains extensive electronic health data from GP and hospital records available as well as DNA methylation data for 9,537 individuals. This makes it uniquely positioned to test whether EpiScore signals can predict disease onset. As the 109 EpiScores were trained on protein levels after adjustment for known pQTLs, they are likely to reflect only nongenetic contributions to the proteome. We ran nested mixed effects Cox proportional hazards models The Cox proportional hazard assumption dictates that hazard ratios for EpiScore -disease associations should remain constant over time. We correlated the Schoenfeld residuals from the models with time to test this. Two associations in the basic model adjusting for age and sex failed to satisfy the global assumption (across all covariates) and were excluded. There were 294 remaining EpiScore-disease associations with a False Discovery Rate (FDR)-adjusted P < 0.05 in the basic model. After further adjustment for common risk factor covariates (smoking, social deprivation status, educational attainment, body mass index (BMI) and alcohol consumption), 137 of the 294 EpiScore-disease associations from the basic model had P < 0.05 in the fully-adjusted model (Supplementary Tables 8-9). Eleven of the 137 fully-adjusted associations failed the Cox proportional hazards assumption for the EpiScore variable (P < 0.05 for the association between the Schoenfeld residuals and time; Supplementary Table 10). When we restricted the time-toevent/censor period by possible years of follow-up, there were minimal differences in the EpiScoredisease hazard ratios between follow-up periods that did not violate the assumption and those that did (Supplementary Table 11). The 137 associations were therefore retained as the primary results.
The 137 associations found in the fully-adjusted model comprised 78 unique EpiScores that were related to the incidence of 11 of the 12 morbidities studied. Diabetes and chronic obstructive pulmonary disease (COPD) had the greatest number of associations, with 33 and 41, respectively. Figure 3 presents the EpiScore-disease relationships for COPD and the remaining nine morbidities: stroke, lung cancer, ischaemic heart disease, inflammatory bowel disease, rheumatoid arthritis, depression, bowel cancer, pain and Alzheimer's dementia. Of the EpiScore signatures contributing to each trait, there were between one and four independent signals, as determined by components with eigenvalues > 1 in principal components analyses (Supplementary Figures 2-4). There were 13 EpiScores that associated with the onset of three or more morbidities. Figure 4 presents relationships for these EpiScores in the fully-adjusted Cox model results. Of note is the EpiScore for Complement 5 (C5), which associated with five outcomes: stroke, diabetes, ischaemic heart disease, rheumatoid arthritis and COPD.
Of the 29 SOMAscan-derived EpiScore associations with incident diabetes, 21 replicated previously reported protein associations 30,31 with incident or prevalent diabetes in one or more cohorts ( Figure   5 and Supplementary Table 12).

Immune cell and GrimAge sensitivity analyses
Correlations of the 109 EpiScores with covariates suggested interlinked relationships with both immune cells and GrimAge acceleration (Supplementary Figure 5). These covariates were therefore added incrementally to the fully-adjusted Cox models (Figure 2). There were 117 associations that remained statistically significant (FDR P < 0.05 in the basic model and P < 0.05 in the fully-adjusted model) after adjustment for immune cell proportions, of which 89 remained significant when GrimAge acceleration scores were added to this model (Supplementary Table 9 and Supplementary Figure 6). In a further sensitivity analysis, relationships between estimated white blood cell proportions and incident diseases were assessed in the Cox model structure independently of EpiScores. Of the 60 possible relationships between WBC measures and the morbidities assessed, four were statistically significant (FDR-adjusted P < 0.05) in the basic model and remained significant with P < 0.05 in the fully-adjusted model (Supplementary Table 13). A higher proportion of Natural Killer cells was linked to decreased risk of incident COPD, rheumatoid arthritis, diabetes and pain.

Discussion
Here, we report a comprehensive DNA methylation scoring study of 953 circulating proteins. We define 109 robust EpiScores for plasma protein levels that are independent of pQTL effects. By projecting these EpiScores into a large cohort with extant data linkage, we show that 78 EpiScores associate with the incidence of 11 leading causes of morbidity (137 EpiScore -disease associations in total). Finally, we describe EpiScore -disease associations that replicate previously measured protein -disease relationships. The bulk of the EpiScore-disease associations are independent of common risk factors, differences in immune cell composition and GrimAge acceleration. EpiScores therefore provide a deeper view into the methylation-proteomic signature of an individual that may uncover early pathways associated with morbidity.
Whereas polygenic scores capture genetic predisposition to disease, epigenetic scores capture the body's response to a range of environmental factors 32 . Our replication of previously identified protein -diabetes relationships 30,31 suggests that epigenetic scores can identify disease-protein pathways in the absence of direct protein measurements, most likely because the EpiScores capture relevant protein regulatory patterns that are determined by the diseases state. The two studies used for replication represent the largest candidate protein characterisations of diabetes to date and the top markers identified included aminoacylase-1 (ACY-1), sex hormone binding globulin (SHBG), growth hormone receptor (GHR) and Insulin-like growth factor-binding protein 2 (IGFBP-2) 30,31 .
We replicate the associations and direction of effect observed with these measured proteins, as well as several other top markers reported in the studies. There is a growing body of evidence suggesting that type 2 diabetes is mediated by a complex interaction between genetic and epigenetic regulators 8,33,34 . As proteins such as ACY-1 and GHR are thought to influence a range of diabetes-associated metabolic mechanisms 35,36 , EpiScores for these proteins may provide an important bridge between epigenetic influences on protein levels and the onset of diabetes. This may facilitate early risk tracking for individuals in the absence of proteomic measurement.
With modest test set performances (for example, SHBG r = 0.18 and ACY-1 r = 0.25), it is perhaps surprising that such strong replication is observed in place of the measured protein. Nonetheless, DNA methylation scores for CRP and IL6 have been shown to perform modestly in test sets (r ~ 0.2, equivalent to ~ 4% explained variance in protein level), but augment and often outperform the measured protein related to a range of phenotypes 22,23 . Though not directly comparable to proteinspecific scores, polygenic scores for disease show utility in risk prediction and stratification with similar performance thresholds to our EpiScores in independent test sets 37,38 . As EpiScores capture a range of influences on DNA methylation, they may therefore highlight a different, but complementary signal to the in vivo protein that is important to disease prediction. In future work, epigenetic protein score profiles could be created for specific diseases; the most extreme ranges of which could be used to stratify risk, similar to recent polygenic scoring approaches 39,40 . Our framework substantially expands the existing repertoire of methylomic predictors for protein levels.
The EpiScores we provide may therefore also augment existing composite epigenetic scores, given that many of the observed associations were independent of DNAm GrimAge.
Thirteen EpiScores were associated with multiple outcomes and the EpiScore for Complement Component 5 (C5) at baseline was associated with the onset of five morbidities, the highest number for any EpiScore. EpiScores that occupy central hubs in the disease-prediction framework may provide evidence of early, upstream pathways acting across morbidities that could be targeted in preventative approaches. C5 is cleaved into C5a and C5b as part of the immune response 41 and elevated levels of these peptides have been associated with severe inflammatory, autoimmune and neurodegenerative states 424344 . A range of C5-targetting therapeutic approaches are currently in development [44][45][46][47][48][49] . EpiScores for acute inflammatory markers have been shown to offer stable longitudinal measurements of the inflammatory state of an individual, which is likely due to DNA methylation reflecting a more consistent profile of stress in the body than protein measurements 23,25 .
Given that there are issues surrounding complement peptide assay sensitivity 50 , our EpiScores may therefore add value, as many are trained on proteins that occupy roles in the innate and adaptive immune response. A multi-protein signature of the inflammasome of an individual is likely to have important applications for understanding and stratifying risk for multiple immune-mediated morbidities.
For conditions such as Alzheimer's dementia, for which the early molecular regulators causing protein aggregates in the brain are still relatively unknown, the acid sphingomyelinase (ASM) EpiScore association we identified highlights disease pathways that may occur years prior to diagnosis. Acid sphingomyelinase is encoded by the SMPD1 gene and converts sphingomyelin to ceramide as part of a lysosomal pathway 51 ; the association we describe supports previous work characterising aberrations to lysosomal pathways in the early stages of Alzheimer's disease [52][53][54] . In murine models of Alzheimer's disease pathology, excessive lysosomal ASM disrupts autophagic protein degradation and associates with accumulation of amyloid-beta and other cytotoxic proteins 52,54 . ASM has therefore been discussed as a therapeutic candidate for aging and Alzheimer's disease 54 . Several of the EpiScore protein pathways we identify are already being pursued as therapeutic candidates in various stages of development. For example, NEP (an Olink-derived EpiScore) associated with incident diabetes; a recent trial demonstrated a reduction in the rate of decline in renal function in individuals that had type 2 diabetes and heart failure resulting from the inhibition of NEP 55,56 . NEP inhibition has also been shown to associate with improved insulin sensitivity in those with obesity and hypertension 57 , which has led to this pathway being proposed as a candidate for type 2 diabetes therapy 58 . EpiScore pathways may therefore help to target therapeutic investigations.
This study has a number of limitations. First, although pQTL effects from GWAS analyses in the samples were regressed out of protein levels prior to training, there may be rare cases of unknown pQTLs that can still influence the predictors. Second, due to missing risk factor covariate information, cases were excluded from the fully-adjusted models; however, effect sizes for the 137 reported associations showed minimal attenuation in the fully-adjusted models. This indicated that missing covariate data had only a marginal effect on the findings. Third, the epitope nature of the protein measurement in the SOMAscan panel may incur probe cross-reactivity and non-specific binding 16 . Fourth, the associations present between EpiScore measures and disease incidence may have been influenced by external factors such as prescription medications for comorbid conditions and comorbid disease prevalence, which should be investigated in future analyses. Fifth, due to known heterogeneity between blood and brain methylation 59 , it is unlikely that EpiScores capture the minutiae of processes in tissues with high proteomic specificity, such as the brain.
Our findings suggest that DNA methylation phenotyping approaches and data linkage to electronic health records in large, population-based studies have the potential to (1) Capture the interindividual, environmental influences on protein levels; (2) Replicate disease associations in the absence of the measured protein; and (3) Augment risk prediction many years prior to morbidity onset. Although the EpiScore-disease associations replicated some known protein-disease findings, validation of other EpiScore-disease findings are still required. Specifically, this should entail the direct measurement of proteins, as inference from EpiScores alone, while useful for disease risk stratification, is not sufficient to determine pathways and mechanisms. Though it was only possible to assess replication for diabetes associations, several of the markers we identify for COPD and IHD have also been identified in previous studies using measured proteins 60,61 . Strengths of this study are the inclusion of cohorts that ranged in age and consisted of both Scottish and German ancestries. As EpiScores trained in these cohorts performed well when applied to cohorts of differing distributions, it is likely that our scores will generalise to other cohorts comprising healthy ageing individuals.
However, these are both cohorts with European ancestries and there is evidence that performance of polygenic scores is poorer when applied to non-European ancestries 40 . It is therefore critical to expand the diversity of training samples in future studies. Another strength is the extensive data linkage capacity in Generation Scotland that allowed us to investigate time-to-event for several common disease outcomes. Given the increasingly widespread assessment of DNAm in cohort studies 62,63 , EpiScores offer an affordable and consistent (i.e. array-based) way to measure multiple biomarkers pertinent to ageing and disease. They may also provide important insights into quantifying the impact of environmental insults 21 . We provide the weights for our EpiScores, such that any study can project them into cohorts with Illumina DNA methylation data.

___
In conclusion, we show that EpiScores for circulating protein levels have the capacity to replicate protein-disease associations and identify novel biomarkers for disease, up to 14 years prior to diagnosis. EpiScore weights are adjusted for known genetic effects and are provided for future studies to utilise. This information is likely to be important in risk stratification and prevention of age-related morbidities.

The KORA sample population
The KORA F4 study includes 3,080 participants who reside in Southern Germany. Individuals were between 32 and 81 years of age when recruited to the study from 2006 and 2008. In the current study, there were 944 individuals with methylation, proteomics and genetic data available. The Infinium HumanMethylation450 BeadChip platform was used to generate DNA methylation data for these individuals. The Affymetrix Axiom array was used to generate genotyping data and the SOMAscan platform was used to generate proteomic measurements in the sample.

DNA methylation in KORA
Methylation data were generated for 1,814 individuals in the KORA F4 study 64 . There were 944 participants that also had protein and genotype measurements available. Briefly, during preprocessing measurements from 65 probes targeting SNPs were excluded and the R package minfi was used to perform background correction. Samples with a detection rate of less than 95% were excluded. Next, the minfi R package was used to perform normalization on the intensity methylation measures 65 , with a method consistent with the Lumi:QN +BMIQ pipeline. White blood cell proportions were estimated through the Houseman method, which utilised 473 of the 500 cellspecific CpG sites that were available on the methylation array. Of the 485,512 CpG sites available, those present on sex chromosomes (11,231 for X and 416 for Y) were excluded, along with 35 sites that had fewer than 100 measures available and 2,993 CpG sites which had non-cg status. A total of 470,837 sites were therefore available for analyses.

Proteomics in KORA
The SOMAscan platform was used to quantify protein levels in undepleted plasma. SOMAscan allows for simultaneous measurement of proteins through an aptamer-based assay 66 . The analysis pipeline was performed by SomaLogic Inc. (Boulder Colorado, USA) and 1129 SOMAmer probes were obtained using the SOMAscan assay V3.2 for samples included in the present study, which is described elsewhere 17 . Briefly, the undepleted EDTA-plasma samples were divided into 0.05, 1 and 40% dilutions and allowed to bind to their respective target SOMAmers during an incubation step in 96-well plates. Following several washes, the bead-bound target proteins underwent biotinylation and flouresence-labelled SOMAmers were photo cleaved, such that these complexes were separated from the bead and pooled. Customised arrays with oligonucleotides that were complementary to the SOMAmers were then used to quantify SOMAmer levels once they were captured on streptavidin beads. The raw intensities quantified through hybridisation between the SOMAmer and oligonucleotides produced a proxy for protein concentration. Several quality controls steps were employed to account for hybridization normalization, signal calibration and median signal normalization to control for inter-plate variation. These used standard samples which were included on each processing plate as comparative measures.
Of the 1,000 samples provided for analysis, two samples were excluded due to errors in bio-bank sampling and one based on quality control measures. Of the 997 samples available, there were 944 individuals with methylation and genotypic data. This cohort were therefore included in the current analyses. Of the 1,129 probes available, five failed the QC, leaving a total of 1,124 probes for the subsequent analysis. Protein measurements were transformed by rank-based inverse normalisation and regressed onto age, sex, known pQTLs and 20 genetic principal components of ancestry derived from the Affymetrix Axiom Array to control for population structure. pQTLs for each protein were taken from a previous GWAS in the sample 17 .

The LBC1936 and LBC1921 sample populations
The

DNAm in LBC1936 and LBC1921
DNA from whole blood was assessed using the Illumina 450 K methylation array at the Edinburgh Clinical Research Facility. Details of quality control have been described elsewhere 69,70 . Raw intensity data were background-corrected and normalised using internal controls. Following background correction, manual inspection resulted in the removal of low quality samples that presented issues related to bisulphite conversion, staining signal, inadequate hybridisation or nucleotide extension. Probes with low detection rate <95% at P < 0.01 were removed. Samples with low call rates (samples with <450,000 probes detected at P < 0.01) were also removed.
Additionally, samples were removed if they had a poor match between genotype and SNP control probes, or incorrect DNA methylation-predicted sex.

Proteomics in LBC1936 and LBC1921
To generate Olink® neurology panel protein measures, plasma was extracted from 816 blood samples collected in citrate tubes at mean age 72.

Generation Scotland and STRADL sample populations
Generation Scotland: the Scottish Family Health Study (GS) is a large, family-structured, populationbased cohort study of >24,000 individuals from across Scotland (mean age 48 years) 71 . Recruitment took place between 2006 and 2011 with a clinical visit where detailed health, cognitive, and lifestyle information was collected along with biological samples (blood, urine, saliva). In GS, there were 9,537 individuals with DNAm and phenotypic information available for this study. The Stratifying Resilience and Depression Longitudinally (STRADL) cohort is a subset of 1,188 individuals from the GS cohort who undertook additional assessments approximately five years after the study baseline 72 . Both DNAm and proteomic data (measured using the SOMAscan platform) were available in 778 individuals from the STRADL cohort.

DNA methylation in Generation Scotland and STRADL
In the main GS cohort, blood-based DNA methylation has been generated in two separate sets using the Illumina EPIC array. Prior to quality control, Set 1 comprised 5,190 related individuals whereas Set 2 comprised 4,583 individuals, unrelated to each other and to those in Set 1. Quality control details have been reported previously 73,74 . Briefly, probes were removed based on (i) outliers from visual inspection of the log median intensity of the methylated versus unmethylated signal per array, (ii) a bead count <3 in more than 5% of samples, and (iii) Ten saliva samples were excluded from Set 1, along with three individuals who had answered "yes" to all self-reported health conditions. One person with suspected XXY genotype and seven genetic outliers were also removed 75 . The quality-controlled dataset comprised 9,537 individuals (n Set1 =5,087, n Set2 =4,450). The same steps were also applied to process the DNAm measurements in the STRADL subset of GS.

Proteomics in STRADL
Measurements for 4,236 proteins in 1,065 individuals from the STRADL cohort were recorded using the SOMAscan® technology -with assay details as per section 4.3. There were 793 epitopes which matched between the KORA and STRADL cohorts and were included for training in KORA and testing in STRADL. Of the 160 Olink® proteins present on inflammatory and neurology panels in LBC1936, 56 matched the SOMAscan SOMAmer IDs and were used to test the EpiScores. In the STRADL cohort, there were 778 individuals with proteomics data and DNAm data available for these test set analyses. These individuals had a mean age of 60.1 ± 8.8 years. Protein measurements were transformed by rank-based inverse normalisation and regressed onto age, sex and 20 genetic principal components (derived from multidimensional scaling of genotype data from the Illumina 610-Quadv1 array).

Electronic health data linkage in Generation Scotland
Over 98% of GS participants consented to allow access to electronic health records via data linkage.
This includes GP records (Read 2 codes), prescription data, and hospital records (ICD codes). These data are available both retrospectively and prospectively from the time of initial blood draw, yielding up to ~14 years of follow-up data. For the current analysis, we considered incident disease data for 12 outcomes that are leading causes of mortality and morbidity (Supplementary Note 1). For each outcome, prevalent cases (ascertained via retrospective ICD and Read 2 codes or self-report from a baseline questionnaire) were excluded from the analyses. Self-report data was not available for the inflammatory bowel disease (IBD) outcome which meant that prevalent cases (i.e. recorded as having disease at baseline) were only excluded based on data linkage codes. A summary of the included and excluded terms for each of the 12 outcomes can be found in Supplementary Tables

14-25.
Alzheimer's dementia was limited to those cases/controls with age of event/censoring greater than or equal to 65 years. Breast cancer analyses was restricted to females only. Recurrent and both major and moderate episodes of depression were included in the depression trait, whereas single episodes of depression were excluded. The diabetes trait was comprised of type 2 diabetes and more general diabetes codes such as diabetic retinopathy and diabetes mellitus with renal manifestation.
All type 1 and juvenile cases of diabetes were excluded from the diabetes trait. An elastic net penalty was specified (alpha=0.5) and cross validation was applied. Both DNAm and protein measurements were scaled to have a mean of zero and variance of one prior to training. In the KORA analyses, 10-fold cross validation was applied. The EpiScores were then tested in STRADL (n=778). Of 480 EpiScores that generated ≥ 1 CpG features, there were 84 with Pearson r > 0.1 and P < 0.05 upon projection into Generation Scotland. As test set comparisons were not available for every protein in the LBC1936 analyses, an initial holdout sample was defined to select the most robust EpiScores for training in the full dataset. In the holdout sample, two folds were set aside as the test data and 10-fold cross validation was carried out on the remaining data (n train =576, n test =130 for neurology and n train =725, n test =150 for inflammatory proteins). The optimal predictors, based on lambda values that minimised the mean cross-validation errors, were applied to the holdout data. We retained EpiScores with Pearson r > 0.1 and P < 0.05 (n=36 EpiScores). New elastic net predictors for these 36 proteins were then generated, using 12-fold cross validation in order to maximise the sample size of the training dataset. The 36 DNAm protein EpiScores from the 12-fold cross validation were then tested externally through correlations with STRADL (n=778) and LBC1921 (n=162, for the neurology panel only) protein measurements. We identified 21 EpiScores with r > 0.1 and P < 0.05 in at least one of the external test sets. Four EpiScores did not have comparisons available and were included based on their LBC1936 holdout performance.

Elastic net protein EpiScores in LBC1936 and KORA
The 109 selected EpiScores were then applied to the DNAm dataset in Generation Scotland (n=9,537). In both the STRADL and GS cohorts DNAm at each CpG site was scaled to have a mean of zero and variance of one prior to the projections. DNAm scaling was performed separately for Set 1 and Set 2 in GS.

Associations with health linkage phenotypes in Generation Scotland
Cox proportional hazards regression models adjusting for age, sex, and methylation set were used to assess the relationship between the 109 selected DNAm protein EpiScores and 12 leading causes of morbidity and mortality in Generation Scotland. Models were run using the coxme package 78 (Version 2.2-16) in R version 4.0 with a kinship matrix specified to account for relatedness in the Set 1 methylation data. Time to first event was ascertained through data linkage. Disease cases included those who had been diagnosed after baseline and subsequently died, in addition to those who received a diagnosis and remained alive. Control subjects were censored if disease free at time of death or at the end of the data linkage follow-up period. EpiScore levels were rank-base inverse normalised prior to the analyses. Fully-adjusted models included the following additional covariates Proportional hazards assumptions were checked by running the fully-adjusted models and extracting Schoenfeld residuals (global test and a test for the protein-EpiScore variable) using the coxph and cox.zph functions from the survival package 81 (Version 3.2-3). These models did not account for relatedness. For each association failing to meet the assumption (Schoenfeld residuals P < 0.05), a sensitivity analysis was run across yearly follow-up intervals.
To understand the influence of immune cells on EpiScore-disease relationships, the fully-adjusted Cox proportional hazards models were run with Houseman-estimated white blood cell proportions as covariates 82 . A further sensitivity analyses ran the fully-adjusted models with the estimated white blood cell proportions and GrimAge acceleration score as covariates, to understand whether the protein-specific EpiScores were contributing an additional signal to that of the GrimAge score in EpiScore-disease relationships. GrimAge acceleration was obtained via Horvath's online calculator 19 .  86 . Figures 1 and 2 were created with BioRender.com.

Replication of protein -disease associations with EpiScores
Replication testing was performed for EpiScore -diabetes associations using two previous studies 30,31 . These studies identified SOMAscan plasma protein measures that associated with both the prevalence and incidence of diabetes. In both studies, two cohorts were included for replication of biomarkers (Study 1: KORA n= 993, HUNT n= 940 31 , Study 2: AGES-Reykjavik n=5,438 and QMDiab n=356 30 ). Study 1 included the KORA dataset, which the SOMAscan EpiScores we generated from in our current study. We characterised which of our SOMAscan-based EpiScorediabetes associations from the fully-adjusted mixed effects Cox proportional hazards models replicated those observed with measured protein levels from the two studies described. We included both basic and fully adjusted results (with either FDR or Bonferroni-corrected P < 0.05) wherever available, across each of the four cohorts (Supplementary Table 12).

Data availability
The informed consent given by the KORA study participants does not cover posting of participant level phenotype and genotype data in public databases. However, data are available upon request from KORA-gen (http://epi.helmholtz-muenchen.de/kora-gen). Requests are submitted online and are subject to approval by the KORA board.
Lothian Birth Cohort 1936 data are available on request from the Lothian Birth Cohort Study, University of Edinburgh (simon.cox@ed.ac.uk). Lothian Birth Cohort 1936 data are not publicly available due to them containing information that could compromise participant consent and confidentiality.
According to the terms of consent for GS and GS:STRADL participants, access to data must be reviewed by the GS Access Committee. Applications should be made to access@generationscotland.org.

Code availability
Code is available with open access at the following Gitlab repository: https://gitlab.com/dannigadd/episcores-for-protein-levels.

Ethics declarations
All KORA participants have given written informed consent and the study was approved by the Ethics Committee of the Bavarian Medical Association.    Counts are provided for the number of cases and controls for each incident trait in the basic and fully-adjusted Cox models run in the Generation Scotland cohort (n=9,537). Mean time-to-event is summarised in years for each phenotype. Alzheimer's dementia cases and controls were restricted to those older than 65 years. Breast cancer cases and controls were restricted to females.

Fig. 2: Nested Cox proportional hazards assessment of EpiScore-disease prediction
Mixed effects Cox proportional hazards analyses in Generation Scotland (n = 9,537) tested the relationship between each of the 109 selected EpiScores and the incidence of 12 leading causes of morbidity (Supplementar Tables 5-6). The basic model was adjusted for age and sex and yielded 294 associations between EpiScores an disease diagnoses, with FDR-adjusted P < 0.05. In the fully-adjusted model, which included common risk facto as additional covariates (smoking, deprivation, educational attainment, BMI and alcohol consumption) 137 of th basic model associations remained significant with P < 0.05. In a sensitivity analysis, the addition of estimate White Blood Cells (WBCs) to the fully-adjusted models led to the attenuation of 20 of the 137 associations. In further sensitivity analysis, 89 associations remained after adjustment for both immune cell proportions an GrimAge acceleration.
hips tary and tors f the ated In a and 0

Fig. 3: EpiScore associations with incident disease
EpiScore-disease associations for ten of the eleven morbidities with associations that achieved P < 0.05 in the fully-adjusted mixed effects Cox proportional hazards models in Generation Scotland (n=9,537). Hazard ratios are presented with confidence intervals for 104 of the 137 EpiScore -incident disease associations reported. Models were adjusted for age, sex and common risk factors (smoking, BMI, alcohol consumption, deprivation and educational attainment). IBD: inflammatory bowel disease. IHD: ischaemic heart disease. COPD: chronic obstructive pulmonary disease. For EpiScore -diabetes associations, see Figure 5. 1

Fig. 4: EpiScores that associated with the greatest number of morbidities
EpiScores with a minimum of three relationships with incident morbidities in the fully-adjusted Cox models. The network includes 13 EpiScores as dark blue (SOMAscan) and grey (Olink) nodes, with disease outcomes in black. EpiScore-disease associations with hazard ratios < 1 are shown as blue connections, whereas hazard ratios > 1 are shown in red. COPD: chronic obstructive pulmonary disease. IHD: ischaemic heart disease.

Fig. 5: Replication of known protein-diabetes associations with EpiScores
EpiScore -incident diabetes associations in Generation Scotland (n=9,537). The 29 SOMAscan (top panel) and four Olink (bottom panel) associations shown achieved P < 0.05 in fully-adjusted mixed effects Cox proportional hazards models. Of the 29 SOMAscan-derived EpiScores, 21 associations replicated (blue) protein -diabetes associations in one or more of the four comparison cohorts that used SOMAscan protein levels. Eight associations were novel (pink). EpiScore -incident diabetes associations that replicated both prevalent and incident proteindiabetes associations are shown as triangles, whereas associations that replicated prevalent associations are shown as squares.