DNA methylation proxies for 16 plasma proteins predict the incidence of 7 leading causes of morbidity

5 Department of Psychiatry, University of Oxford, UK 6 Department of Psychology, The University of Texas at Austin, United States 7 Population Research Center, The University of Texas at Austin, United States 8 Institute for Molecular Bioscience, University of Queensland, Brisbane, QLD, Australia 9 Medical Research Council Human Genetics Unit, Institute of Genetics and Molecular Medicine, University of Edinburgh, Edinburgh, EH4 2XU


Introduction
Ageing is associated with the increased incidence of many chronic morbidities which can raise an individual's risk of disability and mortality. The development of clinical tools for early detection and prevention of such morbidities is therefore a priority. Plasma protein biomarkers can help to achieve this goal, as their divergent expression provides insight into the likely causes and consequences of disease in an individual, oftentimes prior to the onset of clinically relevant symptoms 1 . The recent characterization of the genome-wide genetic and epigenetic architectures of plasma proteins highlights the potential for integrated omics data to uncover causal pathways contributing to protein expression 2,3 and health outcomes [2][3][4][5][6][7] .
Developing robust ways to identify the relationships that exist between genetic architectures, circulating proteins and disease onset is therefore critical to uncovering both the mechanisms of disease and ways to stratify risk.
Despite the clear utility of protein biomarkers, there are several limitations that impede current research. The use of protein expression for risk prediction and causal inference relies on large and consistent datasets. Current work is limited by both the availability of suitable samples and the variability across the many methods used to generate proteomics data 8 . Additionally, proteins involved in the acute-phase inflammatory response are known to be variable in their within-person expression, rendering single-time-point measures of these markers problematic [9][10][11] .
We propose DNA methylation (DNAm) proxies as a means to circumvent these limitations. DNAm involves the addition of a methyl group to a cytosine residue, typically in the context of a CG dinucleotide (CpG), which can regulate gene expression 12 . Genome-wide DNAm is commonly measured by the Illumina Methylation BeadChip and is widely profiled in cohort studies that do not have proteomics data available; DNAm proxies for protein expression would subsequently create new opportunities for large and consistent analyses in many existing cohorts. Moreover, as previously demonstrated for C-reactive protein (CRP) and Interleukin-6 (IL6), single-time point DNAm proxies have the strong potential to be more stable than serological measures over multiple longitudinal measurements, [13][14][15] offering a means to minimise the variability known to affect acute-phase inflammatory protein measures. We therefore hypothesise that DNAm can offer an intermediate proxy for the expression of plasma proteins to advance disease biomarker detection.
Here, we developed blood-based DNA methylation proxies for plasma protein expression and related these proxies to 12 leading causes of morbidity and mortality (Table 1, Fig. 1). A panel of 160 inflammatory and neurology proteins in a group of up to 875 individuals from the Lothian Birth Cohort 1936 were used to generate DNAm proxies through elastic net penalised regression models. This initial cohort was first split into train and holdout test sets to evaluate proxy performance. The models were then re-run using all possible individuals as the training set and proxies were applied to two separate test sets in order to select the most robust measures. The selected proxies were then projected into an independent test sample of 9,537 individuals from Generation Scotland, the largest single cohort to have DNAm information available for such predictions. Taking advantage of retrospective and prospective data linkage to primary (general practice records) and secondary care (hospital) records, we relate protein proxies to the incidence of 12 diagnoses over a follow up period of up to 14 years. Ten of the traits are recognised by the World Health Organisation as leading causes of either morbidity or mortality and include diseases such as chronic obstructive pulmonary disease (COPD), diabetes, stroke and Alzheimer's dementia 16,17 . Rheumatoid arthritis and inflammatory bowel disease were also included as they are chronic conditions which constitute substantial chronic health burdens 18,19 .
This work demonstrates that protein-by-proxy analyses can uncover early markers that may be critical to disease incidence, several years prior to diagnoses. As these diseases constitute a major global health burden and can profoundly affect the wellbeing of an individual as they age, markers that complement risk stratification and deployment of preventative interventions are essential. The availability of our predictors in a Shiny App (MethylDetectR) allows for the projection of protein expression into any cohort which has Illumina DNAm data, creating opportunities for historic and existing datasets to be probed with no requirements for additional proteomics sampling. This approach facilitates a deeper understanding of the early mechanisms associated with disease onset and identify candidate pathways for therapeutic targeting.

Creation and validation of DNAm proxies for protein expression
There were 18 neurology and 20 inflammatory proxies generated through elastic net regressions which correlated (r > 0.1, P < 0.1) with protein expression when applied to the initial holdout set in the Lothian Birth Cohort 1936 (LBC1936). Predictors for these 38 proteins were then created using the full LBC1936 cohort as the training dataset in elastic net regressions. This generated protein predictors for 37 proteins; one protein (TRAIL) contained only an intercept term (i.e. no nonzero features) and was therefore excluded.
There were between 11 and 457 features for the 37 proxies (median = 84). Validation of these proxies in two independent test sets (the Stratifying Resilience and Depression Longitudinally: STRADL subset of Generation Scotland, n=778, and the Lothian Birth Cohort 1921: LBC1921, n=162) resulted in the selection of 14 neurology and nine inflammatory proxies which performed optimally (r > 0 and P < 0.05 in at least one test cohort, Fig. 2 Tables 1-2). A further three inflammatory proxies were included without comparisons available, based on their performance in the initial holdout set (P < 0.05). Though the IL6 proxy performed poorly in the GS:STRADL test set, it has been validated against ELISA measures previously and was therefore also included 14 . Predictor weights for these 27 proxies (13 inflammatory, 14 neurology) are provided in Supplementary Table 3. Across the selected 27 proxies, there were 223 CpG probes which were included in 2 or more proxies; a summary for each CpG site with annotations to the MRC-IEU EWAS catalog 20 traits (P < 3.6 x 10 -8 ) is presented in Supplementary Table 4. The smokingrelated site cg05575921 was the most frequently selected and was present in 12 proxies. The highest proportion of cis associations -CpGs within 10Mb of the transcription start site of the gene encoding the protein -were found for MDGA1 and IL-18R1 (11% and 6%, respectively).

Proxy associations with incident diseases in Generation Scotland
There were 72 associations between the selected DNAm protein proxies and time-to-event disease incidence in the basic mixed effects Cox proportional hazards regression models with P < 0.05 after FDR correction (Supplementary Table 5). This equated to a maximum uncorrected P-value 0.01. Of the 72 associations, 26 remained significant with P < 0.05 in the fully-adjusted model accounting for age, sex and common risk factors, including alcohol consumption, body mass index, deprivation, educational attainment, and a DNAm-based proxy for smoking (Supplementary Tables 6-7). Mean attenuation of log hazard ratios was -60.5% (ranging from -163.2% to 26.5%, with 69 attenuated, two enhanced and one unchanged), across the 72 proxies and -33.4% (ranging from -78.9% to 26.5%, with 23 attenuated, two enhanced and one unchanged) across the 26 significant associations. Globally, the proportional hazards assumption was satisfied for all models. However, six of the 26 fully-adjusted associations failed the proportional hazards assumptions for the protein-proxy variable (P < 0.05 for the association between the Schoenfeld residuals and time; Supplementary Table 8). Restricting the time-to-event/censor period by possible years of followup, there were minimal differences in the proxy-disease hazard ratios between follow-up periods which did not violate the assumption and those that did (Supplementary Table 9). No associations were therefore excluded.
A hazard ratio-weighted network summary of the 26 proxy-disease relationships with P < 0.05 in the fullyadjusted model is presented in Fig. 3. A summary of the hazard ratios and confidence intervals for each proxy-disease relationship is presented in Fig. 4. These findings involved 16 unique protein proxies (nine inflammatory and seven neurology) and seven disease outcomes: depression, diabetes, ischaemic heart disease, stroke, rheumatoid arthritis, COPD and lung cancer. A one standard deviation increase in DNAm proxy CCL11 levels at baseline was associated with risk of incident depression (HR = 1.45, 95% CI = [1.19, For example, in addition to the association found for diabetes, FGF-21 proxy levels at baseline were associated with stroke (HR = 1.24, 95% CI = [1.08, 1.43], P = 0.002). Whereas higher levels of these proxies were associated with increased risk, higher levels of the SIGLEC1 proxy were linked to a reduced incidence  Table 7).
Full correlation structures for the 27 proxies included in the Cox analyses and the 16 which were associated with incident diseases are presented in Supplementary Fig. 1. There were no correlations between protein proxy measures and common risk factor covariates with r > 0.30, with the exception of the EpiSmokEr DNAm-derived measure of smoking and age (Supplementary Fig. 2). There were four diseases which were associated with multiple proxy predictors; correlation structures and principal components analyses for these proteins are presented in Supplementary Fig. 3. Filtering by an eigenvalue > 1, there were three components for the 12 COPD-associated proxies and two components across the five lung cancer proxies, suggesting that DNAm proxies were reflecting several independent proteomic signatures. One component with eigenvalue > 1 was present for both ischaemic heart disease and diabetes. A sensitivity analysis which removed controls who died after the study baseline from the Cox models did not considerably alter the 26 associations (Supplementary Table 7).

White blood cell influences on proxies
There were correlations of up to r = 0.77 between protein proxies and estimated white blood cell (WBC) proportions in the Generation Scotland cohort (Supplementary Fig. 2). A sensitivity analysis was therefore conducted, which adjusted for estimated WBC proportions in the fully-adjusted Cox proportional hazards models. In this analysis, 19 of the 26 fully-adjusted associations remained significant (Supplementary Table 10). In the 19 associations, there was an overall mean attenuation in log hazard ratios of -12.9%, ranging from -31.3% to 33.8% in relation to the fully-adjusted model, with 15 attenuated, 3 enhanced and one unchanged. In a further sensitivity analysis, relationships between estimated WBC proportions and incident diseases were assessed in the Cox model structure, independently of proxies. Four inverse relationships (higher cell proportions linked to decreased disease risk) were found between natural killer cells and the incidence of COPD, rheumatoid arthritis, diabetes and pain (Supplementary Table 11).
To explore the interplay between measured white blood cells and our proxies, the elastic net penalised regression models used in the train/test optimisation phase in LBC1936 were re-run, with measured WBCs included as features. Of the six immune cells available (Methods), neutrophil and monocyte features were selected for and increased the correlation strength between the proxy and measured proteins in the test set for seven of the nine inflammatory proxies and three of the seven neurology proxies, providing further evidence of a tightly interlinked relationship (Supplementary Table 12).

Protein Quantitative trait Locus (pQTL) mapping
To determine if pQTL mapping was possible with the proxy measures, GWAS analyses were run for the proxies corresponding to 7 proteins with GWAS significant SNPs in previous Lothian Birth Cohort 1936 analyses 2,3 (N=9 genome-wide significant SNPs; Supplementary Note 1). We replicated 7/9 sites from previous studies at P < 5 x 10 -8 , six of which were cis associations for the protein coding gene, and one of which was trans (rs46876657; Supplementary Table 13) 2,3 . Moreover, all 7 SNPs were within 75kb of a CpG that was included in the corresponding protein proxy, six of which were previously reported as methylation quantitative trait loci (mQTLs) for protein proxy CpGs [PMID 27036880] 21 . The proxies therefore largely capture mQTLs.

Discussion
By projecting DNA methylation proxies for protein expression into a large cohort with extant data linkage, we have identified 16 protein proxies that predict the incidence of seven leading causes of morbidity, after controlling for common risk factors such as smoking and deprivation.
Here, we developed DNAm proxies for protein expression to advance disease biomarker detection and offer a means to identify therapeutic targets. We were able to validate several proxies, demonstrating that DNAm can proxy for the expression of these proteins. Further to this, many of the proxy-disease relationships that we identify have been found linking true protein measurements and diseases, suggesting that our methylomic proxies capture clinically relevant facets of these pathways. Proteins corresponding to proxies associated with incident disease in our study are the targets for current therapeutic approaches; a recent trial demonstrated a reduction in the rate of decline in renal function in individuals that had type 2 diabetes and heart failure resulting from the inhibition of NEP 22,23 . NEP inhibition has also been shown to associate with improved insulin sensitivity in those with obesity and hypertension 24 , which has led to this pathway being proposed as a candidate for type 2 diabetes therapy 25 . Gene therapies targeting VEGFA to promote localised angiogenesis are also in ongoing trials for the treatment of ischaemic heart disease 26,27 and several trials are in progress for anti-SIGLEC antibody therapies in cancer due to the role of this receptor family in the modulation of tumour-associated macrophages 28,29 . Taken together, these examples suggest that our DNAm proxies are able to identify disease-relevant pathways for therapeutic targeting. Though our associations are in some cases contradictory to these therapeutic strategies, such as the inverse association found between SIGLEC1 and lung cancer, these instances may reflect a time-critical, systemic variation in protein expression during the window prior to diagnosis and their causality should therefore be explored further.
Nine of the proxies were identified as potential biomarkers for the onset of more than one disease. For example, the FGF-21 proxy measure associated with incident stroke and diabetes. Serum and plasma measures of FGF-21 have been shown to be predictive of type 2 diabetes and metabolic syndrome 30-35 , with FGF-21 implicated in the response to metformin 36 . FGF-21 has also been characterised as a blood-based marker for poor cardiovascular health, both generally and in those with type 2 diabetes 37-40 . DNAm proxies may therefore uncover important, but as yet, undefined nodes relevant to multiple diseases that are linked through the CpG features contributing to them. There were also four diseases which were associated with multiple, often intercorrelated proxies. Many of the proteins implicated in proxy-COPD relationships have been identified as potential markers for the disease and are thought to contribute to destruction of lung tissue as part of an ongoing, inflammatory state 41-44 . Principal components analysis suggests that the proxies are capturing multiple, distinct signatures of inflammatory protein expression in those that subsequently receive a COPD diagnosis. Given that anti-inflammatory therapy for COPD is highly desired but has as yet been challenging to achieve 45,46 , proxy-driven insight into the interrelatedness of the wider inflammasome may offer valuable context for therapeutic strategies.
Though the known disparities between blood and brain DNA methylation 47,48 may have limited the detection of markers relevant to neurological diseases with unique pathology in the brain, a relationship was found between CCL11 and depression. The mechanisms by which CCL11 may be related to depression are unclear; however, CCL11 is thought to mediate peripheral and central nervous system inflammation, with evidence that it has microglial and astrocytic targets 49 . As CCL11 is suggested as one of several cytokine plasma markers that may identify those with depression 50-53 , circulating DNAm proxies could prove to be relevant markers for the stratification of psychiatric illness risk.
We found that the genetic architecture of the proxy proteins largely captures either mQTLs, or pQTLs which constitute mQTLs for the CpG sites contributing to the protein proxies. The proxies are therefore unlikely to identify novel pQTL findings. Though testing in a holdout set and two external cohorts suggested that many of our 27 optimal proxies were robustly capturing protein expression, there were many proteins for which we did not achieve reasonable proxies. As the training sample increases, we expect convergence between proxy projections and measured proteins; however, our previous work indicates that there is a threshold for variance explained in protein expression by genome-wide DNAm 2,3 . Nevertheless, even where DNAm proxies CRP and IL6 correlate ~0.2 with measured protein levels, they provide a more stable measure of expression than proteomic measures when averaged longitudinally; they often outperform the measured proteins in relation to associations with health outcomes and lifestyle factors [13][14][15] . As with CRP and IL6, many of the proteins we have created proxies for are involved in or associated with the acute-phase response; GZMA is thought to promote the release of IL6, IL8 and TNF-alpha, thereby inducing a cytokine syndrome in those with sepsis (a condition hallmarked by a rapid alterations in the inflammatory state) 11,54 .
The proxies we have created for GZMA and IL8 may therefore be more stable than longitudinal serum measurements. Consequently, the protein proxy-phenotyping approach may augment insights into inflammatory pathways which can be difficult to quantify due to natural and disease-associated variability 55,56 .
This study has a number of limitations. First, the size of the protein training dataset constrained predictor generation. Second, due to missing covariate information, cases were excluded from the fully-adjusted models; however, this had a marginal influence on the main associations. Third, whereas there was a degree of attenuation in proxy-trait relationships upon adjustment for white blood cell proportions, our analyses highlight the interrelated nature of these measures with the proxies. Though many of the strongest relationships withstood adjustment for white blood cell proportions, measured white blood cells were selected as contributing features for many of the proxies in our elastic net sensitivity analyses. It is therefore challenging to establish directionality between the proxies, immune cells and diseases; however, the selection of immune cells as features contributing to proxies presents an interesting avenue for further exploration. Fourth, Cox model effect sizes should be interpreted with the caveats that hazard ratios reflect a relatively arbitrary scale (per SD of the DNAm score) and that DNAm scores were generated using relative protein measures, rather than absolute quantification. Fifth, the associations present between proxy measures and disease incidence may have been influenced by external factors such as prescription medications and disease prevalence at baseline, which should be investigated future analyses. Sixth, though we show that many of the proxies trained in an age-homogenous cohort performed well when applied to cohorts of differing age distributions, it is likely that protein measurements in our cohorts are somewhat age-sensitive.
It is therefore possible that our proxies may not generalise optimally beyond a cohort of healthy ageing individuals. Finally, our proxies were trained and tested on individuals from relatively homogeneous Scottish genetic heritage, which may limit their applicability to individuals from other genetic ancestries.
This current application is particularly valuable for cohorts such as Generation Scotland, which does not currently have protein data available but is the largest single-cohort DNAm resource in the world. We have created a Shiny app (MethylDetectR 57 ) to enable any study with Illumina-based DNA methylation data to easily generate and visualise projections for the 27 protein proxies in addition to DNAm predictors of lifestyle 58 and chronological age 59 . These proxies can be filtered by age and sex and visualised for an individual (or group of individuals e.g., disease cases) relative to the rest of the input cohort (Fig. 5).
Another strength is the extensive data linkage capacity in Generation Scotland that allowed us to investigate time-to-event for several common disease outcomes. Whereas, the number of incident cases was modest for some traits, the extant nature of the linkage means that we will continue to acquire cases across all disease areas. Our findings suggest that proxy phenotyping approaches and data linkage to electronic health records in large, population-based studies have the potential to (1) capture clinically relevant facets of true protein expression; (2) highlight novel disease-associated proteins and mechanisms, many of which have existing drug targets; and (3) augment risk prediction years prior to disease onset. This knowledge is integral to the early detection and improved risk stratification of complex diseases, which are central aims for both biomedical research and public health 60,61 .

___
In conclusion, we validate a novel framework for the large-scale identification of protein biomarkers associated with disease. Through this framework, we show that DNA methylation proxies for the expression of 16 plasma proteins predict the incidence of seven leading causes of morbidity and mortality. This work highlights the potential for methylomics approaches to uncover the drivers of multimorbidity as we age and provides context relevant to preventative interventions.

Lothian Birth Cohorts of 1936 and 1921
The Separate panels of 92 inflammatory and 92 neurology proteins (Olink® antibody-based technologymeasurement details in Supplementary Note 2) were assessed in plasma samples from the first and second waves of the LBC1936 study, respectively (mean age 69.6 years for inflammatory n=875 and 72.5 years for neurology n=706). The Olink® neurology panel was also assessed in plasma samples from wave 3 of the LBC1921 cohort in 162 individuals (mean age 86.7 years). Protein levels were rank-based inverse normal transformed and regressed on age, sex and four genetic principal components. This was performed separately for the train and test sets used in the penalised regression models. two neurology proteins, MAPT and HAGH, and twenty-one inflammatory proteins were excluded due to >40% of observations being below the lower limit of detection; one further inflammatory protein, BDNF, failed quality control and was also removed from the study. This resulted in 160 protein measurements across both panels for use.
Blood-based DNA methylation was assessed using the Illumina 450k array. Quality control details are reported in Supplementary Note 2. There were 459,308 methylation sites measured in the LBC1936. To permit comparison across platforms, sites that overlapped between the Illumina 450k and Illumina EPIC 1 1 arrays were used as the features (93% of the original sites, n=428,489) for the penalised regression models.
CpG features were scaled to have a mean of zero and variance of one prior to projections into cohorts.
White blood cell measures in the LBC1936 were acquired using the same blood samples taken for methylation and were collected and processed on the same day. Technical details for these measures have been outlined previously 65

. Monocytes, granulocytes, natural killer cells, B cells and both CD8T and CD4T
cells were available for inclusion in the elastic net sensitivity analyses. In the main GS cohort, blood-based DNA methylation has been generated in two separate sets using the Illumina EPIC array. Prior to quality control, Set 1 comprised 5,190 related individuals whereas Set 2 comprised 4,583 individuals, unrelated to each other and to those in Set 1. Quality control details have been reported previously and are also detailed in Supplementary Note 2. Briefly, probes were removed based on (i) outliers from visual inspection of the log median intensity of the methylated versus unmethylated signal per array, (ii) a bead count <3 in more than 5% of samples, and (iii) ≥ 0.5% of samples having a detection P value >0.05 in Set 1 and ≥ 1% of samples having a detection P value >0.01 in Set 2. Samples were removed (i) if there was a mismatch between their predicted sex and recorded sex and/or (ii) if ≥ 1% of CpGs had a detection P value >0.05 in Set 1 and >0.5% of CpGs had a detection P value >0.01 in Set 2. Ten saliva samples were excluded from Set 1, along with three individuals who had answered "yes" to all self-reported health conditions. One person with suspected XXY genotype and seven genetic outliers were also removed 68 . The quality-controlled dataset comprised 9,537 individuals (n Set1 =5,087, n Set2 =4,450).

Generation Scotland and STRADL
Over 98% of GS participants consented to allow access to electronic health records via data linkage. This includes GP records (Read 2 codes), prescription data, and hospital records (ICD codes). These data are available both retrospectively and prospectively from the time of initial blood draw, yielding up to approximately 14 years of follow-up data. For the current analysis, we considered incident disease data for 12 outcomes that are leading causes of mortality and morbidity (Supplementary Note 3). For each outcome, prevalent cases (ascertained via retrospective ICD and Read 2 codes or self-report from a baseline questionnaire) were excluded from the analyses. Self-report data was not available for the inflammatory bowel disease (IBD) outcome which meant that prevalent cases (i.e. recorded as having disease at baseline) were only excluded based on data linkage codes. Codes were excluded if they were not closely related to the 12 diseases and a summary of the included and excluded terms can be found in Supplementary Tables 14-25. Alzheimer's dementia was limited to those cases/controls with age of event/censoring greater than or equal to 65 years. Breast cancer analyses was restricted to females only. Recurrent and both major and moderate episodes of depression were included in the depression trait, whereas single episodes of depression were excluded. The diabetes trait was comprised of type 2 diabetes and more general diabetes codes such as diabetic retinopathy and diabetes mellitus with renal manifestation. All type 1 and juvenile cases of diabetes were excluded from the diabetes trait.

Ethics declarations
Ethical approval for the LBC1921 and LBC1936 studies was obtained from the Multi-Centre Research

Elastic Net Protein Predictors
Penalised regression models were generated for each of the 90 neurology and 70 inflammatory proteins in the Lothian Birth Cohort 1936 using Glmnet (Version 4.0-2) 69 in R (Version 3.6.0) 70 . Protein levels were the outcome and there were 428,489 CpG features per model. An elastic net penalty was specified (alpha=0.5) and cross validation was applied. To reduce the possibility of overfitting in the cross-fold validation step, each fold represented a single methylation processing batch or a combination of batches, with a range of 50-68 and 62-85 individuals per fold for the neurology and inflammatory analyses, respectively. Two folds were set aside as the test data and 10-fold cross validation was carried out on the remaining data (n train =576, n test =130 for neurology and n train =725, n test =150 for inflammatory). The optimal predictors, based on lambda values that minimised the mean cross-validation errors, were applied to the test data and we retained proxies with r > 0.1 and P < 0.1 against measured proteins in the holdout set (n=18 neurology and 20 inflammatory proxies). We generated new elastic net predictors for these 38 proteins, using 12-fold cross validation in order to maximise the sample size of the training dataset. All except one of the inflammatory folds represented a single batch. Of the 12 neurology folds, three were assigned to a singular batch and the remainder were composed of either 2-3 batches. Individuals per fold ranged from 62-85 and 49-81 for the inflammatory and neurology analyses, respectively.
Of the 38 DNAm proxies chosen from the optimisation step, 37 generated sufficient features in the elastic net regressions on the full Lothian Birth Cohort 1936. The remaining protein (TRAIL) only contained an intercept term (i.e. no nonzero features) and was therefore excluded. The 37 DNAm protein proxies from the 12-fold cross validation were then tested externally through correlations with STRADL (n=778, for both inflammatory and neurology panels) and LBC1921 (n=162, for the neurology panel) protein measurements.
Comparisons were available for all 18 neurology proxies and 14 of the 20 inflammatory proxies. We identified 14 neurology proxies and 9 inflammatory proxies with P < 0.05 and r > 0 in at least one of the external test sets. Of the 5 inflammatory proxies which had no comparison available, 3 were included based on their performance in the holdout set (P < 0.05) and two (CCL11 and TNFB) were excluded (holdout set P > 0.05). IL6 did not achieve the required thresholds but has been shown to perform well against ELISA measures previously and was therefore included 14 . The 27 chosen proxies (13 inflammatory and 14 neurology) were then applied to the DNAm dataset in Generation Scotland (n=9,537). In both the STRADL and GS cohorts DNAm at each CpG site was scaled to have a mean of zero and variance of one prior to the projections.
A sensitivity analysis was performed where we re-ran the elastic net penalised regression models used in the initial optimisation step within LBC1936 with the addition of measured white blood cell proportions as features.

Associations with health in Generation Scotland
Cox proportional hazards regression models adjusting for age, sex, and methylation set were used to assess the relationship between the 27 selected DNAm protein proxies and 12 leading causes of morbidity and mortality in Generation Scotland. Models were run using the coxme package 71 (Version 2.2-16) in R version 3.6.3 with a kinship matrix specified to account for relatedness in the Set 1 methylation data. Time 1 4 to first event was ascertained through data linkage. Disease cases included those who had been diagnosed after baseline and subsequently died, in addition to those who received a diagnosis and remained alive. Proportional hazards assumptions were checked by running the fully-adjusted models and extracting Schoenfeld residuals (global test and a test for the protein-proxy variable) using the coxph and cox.zph functions from the survival package (Version 3.2-3) 75 . These models did not account for relatedness and random effects. For each association failing to meet the assumption (Schoenfeld residuals P < 0.05), a sensitivity analysis was run across yearly follow-up intervals. To ensure that underlying health conditions which had not been diagnosed in controls who had died post-baseline were not influencing the main findings, a sensitivity analysis was run which excluded these individuals from the Cox models.
In a further sensitivity analysis, adjusted Cox proportional hazards models were re-run with Housemanestimated white blood cell proportions as covariates 77

Data availability
Lothian Birth Cohort 1936 data are available on request from the Lothian Birth Cohort Study, University of Edinburgh (simon.cox@ed.ac.uk). Lothian Birth Cohort 1936 data are not publicly available due to them containing information that could compromise participant consent and confidentiality.
According to the terms of consent for GS participants, access to data must be reviewed by the GS Access Committee. Applications should be made to access@generationscotland.org.

Code availability
All code is available at the following Gitlab repository: https://gitlab.com/marioni-group/dnam-proteinproxies. Access will be provided upon request.

)
Counts are provided for the number of cases and controls for each incident trait in the Generation Scotland cohort (n=9,537). Mean time-to-event is summarised in years for each phenotype.

Fig. 1: A new framework for protein-by-proxy biomarker analyses.
Elastic net penalized regression models were used to create DNAm proxies for protein expression. Aft evaluation in a holdout set, the optimal predictors (r > 0.1, P < 0.1) were trained on the full LBC1936 cohort an tested in a further two external test sets: GS:STRADL (n=778) and LBC1921 (n=162). Selected proxies we projected into Generation Scotland (N=9,537) and related to the incidence of 12 leading causes of morbidity an mortality, through GP and hospital electronic health data linkage over a period of up to 14 years. Proxy-diseas associations with FDR-adjusted P < 0.05 in the basic and P < 0.05 in the fully-adjusted Cox mixed effec proportional hazards models were identified as significant. All protein measurements were rank-based invers normal transformed and regressed on age, sex and four genetic principal components. All CpGs were scaled have a mean of 0 and standard deviation of 1. Image created with BioRender.com.
fter and were and ease fects erse d to a, Results from the evaluation of DNAm proxies for protein expression in the LBC1936 cohort for Olink neurology (train n=576, test n=130) and inflammatory (train n=725, test n=150) protein panels. Of the 16 proteins used in the proxy generation step, there were 38 proxies with r > 0.1 and P < 0.1 in the initial holdout s which were re-run using the full LBC1936 data as the training set. Of these 38, the inflammatory (b) an neurology (c) proxies shown performed well (P < 0.05) in either one or both of the subsequent GS:STRAD (n=778) and LBC1921 (n=162) test sets, wherever protein data was available for comparison. IL6 was previous validated against ELISA measurements and was therefore included despite poor performance. The correlatio coefficients are plotted here for the final 27 proxies (13 inflammatory and 14 neurology) with upper and low confidence intervals. Proxies are ordered by performance in the initial LBC1936 holdout se nk® 160 set and DL usly tion wer set.

Fig. 3: DNAm proxy associations with incident disease.
DNAm proxy measures for inflammatory (green) and neurology (blue) panel proteins which predicted incide diseases (white) in the Generation Scotland cohort (N=9,537). All 26 relationships between proxies and disease presented here were significant with FDR-corrected P < 0.05 in the basic Cox mixed effects proportional hazar model and were also significant with P < 0.05 in the fully-adjusted model after accounting for age, sex an common risk factors. The connecting edges of the network represent proxy-disease relationships and the thickness is weighted according to log hazard ratios. Positive associations indicating increased risk of disease a shown as black edges. Protective, negative associations are shown as grey edges. IHD: ischaemic heart diseas RA: rheumatoid arthritis. COPD: chronic obstructive pulmonary diseas dent ases zard and their are ase. ase.

Fig. 4: Hazard ratios for DNAm proxy-disease relationships.
Hazard ratios for the Olink® inflammatory and neurology protein proxies (per SD increase) related to health outcomes in the Generation Scotland cohort (n=9,537). The 26 proxy-disease associations which were significant with FDR-corrected P < 0.05 in the basic Cox mixed effects proportional hazards model and significant with P < 0.05 in the fully-adjusted model after accounting for age, sex and common risk factors are presented with confidence intervals. IHD: ischaemic heart disease. RA: rheumatoid arthritis. COPD: chronic obstructive pulmonary disease.   In this panel, the distributions of estimated FGF-21 levels are shown for individuals with incide diabetes (cases; blue) and for those who remained free of the disease (controls; pink) in the present study. Th score for a selected individual is illustrated by the dotted vertical line. The user can subset the sample by ag range and sex. (b) In this panel, the user can choose to view percentile ranks for a selected individual whe compared to the remainder of the sample for up to 10 traits simultaneously. Alternatively, as shown here, the us can select an option to show the median percentile for cases in their sample with respect to a binary trait o interest that is uploaded by the user. Interquartile ranges for case percentile ranks are shown by horizontal bars. I this example, the median percentile for diabetes cases are plotted for a number of physical traits and the thre protein proxies which were associated with incident diabetes in a fully-adjusted Cox model: FGF-21, N-CDas and NEP.

ion.
ased ls in dent The age hen user it of s. In hree ase