## Abstract

DNA methylation is a stable epigenetic alteration that plays a key role in cellular differentiation and gene regulation, and that has been proposed to mediate environmental effects on disease risk. Epigenome-wide association studies have identified and replicated associations between methylation sites and several disease conditions, which could serve as biomarkers in predictive medicine and forensics. Nevertheless, heterogeneity in cellular proportions between the compared groups could complicate interpretation. Reference-based cell-type deconvolution methods have proven useful in correcting epigenomic studies for cellular heterogeneity, but they rely on reference libraries of sorted cells and only predict a limited number of cell populations. Here we leverage >850,000 methylation sites included in the MethylationEPIC array and use elastic net regularized and stability selected regression models to predict the circulating levels of 70 blood cell subsets, measured by standardized flow cytometry in 962 healthy donors of western European descent. We show that our predictions, based on a hundred of methylation sites or lower, are less error-prone than other existing methods, and extend the number of cell types that can be accurately predicted. Application of the same methods to age, smoking consumption and several serological responses to pathogen antigens also provide accurate estimations. Together, our study substantially improves predictions of blood cell composition based on methylation profiles, which will be critical in the emerging field of medical epigenomics.

## Introduction

Cellular subtypes that compose organisms derive from various differentiation lineages during development. As stem cells differentiate into more specialized cells, their genome accumulates epigenetic modifications, *i.e.*, stable chemical additions to the DNA that can affect gene expression but do not change the DNA sequence, resulting in cell-specific gene expression. DNA methylation (DNAm), a stable epigenetic mark that refers to the attachment of a methyl group to DNA cytosine, plays a key role in cellular differentiation and gene regulation. Epigenome-wide association studies (EWAS) have searched for DNAm sites that covary with disease conditions or disease-related traits, as these DNAm changes could mediate the effects of environmental perturbations on the transcriptional reprogramming of differentiated cells and, in turn, organismal phenotypes (1, 2). However, interpretation of the results can be problematic, because statistical associations between DNAm and a condition of interest could be due to either a perturbation of the epigenetic properties of a cell subtype that causes the condition, or heterogenetiy in the proportions of differentiated cells caused by the condition (3, 4). For example, because rheumatoid arthritis triggers a change in the granulocyte-to-lymphocyte ratio, an EWAS of this disease identified thousands of associated DNAm sites that became non-significant upon correction for cellular heterogeneity (5). Thus, there is a clear need in the epigenomics field for methods that reliably enumerate cell sub-populations from heterogeneous tissues (6).

Currently, the gold standard approach for cell counting is flow cytometry, a laser-based technology that simultaneously detects several fluorescent-labelled protein markers at a single-cell resolution. However, this approach is labour-intensive and costly, requires skilled practitioners, and its performance is affected by sample degradation. Alternatively, cell composition can be indirectly estimated from gene expression profiles, which are known to be cell-specific (7, 8). These methods, referred to as cellular deconvolution, rely on transcriptional profiles of reference cell populations to predict the cellular composition of sampled cell mixtures, which are also strongly affected by degradation and are difficult to standardize. In a seminal study, Houseman and colleagues used projection methods similar to the ones used for gene expression to estimate blood cell mixture proportions from DNAm profiles (9), a more stable molecular measure. Because DNAm changes are thought to be involved directly in the lineage decision of hematopoietic cells (10, 11), they provide a direct link with blood cell identity. This method, referred to as the ’Houseman method’ or ’Houseman model’, uses DNAm profiles from six sorted cell subtypes as a reference, and assumes that the heterogeneous sample of interest is a mixture of these cells, whose proportions are estimated by projecting the sample matrix on to the reference matrix. The method can estimate the proportion of six major immune cells in blood, using a reference library of 600 CpG sites. Koestler *et al.* proposed a refined reference library, called IDOL (12), achieving better estimation of the six subsets with only 300 CpG sites. Although these methods have been extensively used, they only estimate six major cell subsets, and need at least 300 probes, limiting their usefulness as a tool for adequately controlling confounding in EWAS, and for applications in clinical research.

Here, we build novel parsimonious models for predicting the circulating levels of 70 blood cell subsets measured by flow cytometry in 962 healthy donors of the Milieu Intérieur study (13, 14), based on blood DNAm levels at >850,000 sites (Illumina Methylation EPIC beadchip; (15)). The models are based on two key assumptions: 1) methylation at some sites marks differentiation events that can identify a particular blood cell lineage, and 2) only few methylation probes on the EPIC array mark such differentiation events. The first assumption implies a linear relation between the cell proportion in whole blood and methylation levels at the sites that mark it. We therefore use linear regression models to predict blood cell composition from DNAm levels. The second assumption means that only a small fraction of the probes will actually be predictive. We must therefore look for *sparse* models, which discard many of the included predictors in a data-driven fashion.

We use two approaches to build predictive models of immune cell proportions. The two assumptions mentioned above lead naturally to regularized linear regression models. Therefore, to infer optimal models in terms of prediction accuracy, and to investigate how prediction accuracy depends on the number of predictors, we use the elastic net method (16, 17). Similar models have previously been used for the prediction of age, smoking status, alcohol consumption and educational attainment based on DNAm (18, 19). We believe that elastic net regression will be able to find both the predictors that mark differentiation events for the lineage of the cell, but also the numerous probes that are correlated with such predictors. In addition, we use the more stringent selection technique, *stability selection* (20, 21), to find a minimal stable set of predictors for each proportion. Stability selection selects predictors of each immune cell proportion that are consistently predictive in 100 subsamples of the dataset. We then build predictive models from the stability selected set of predictors using ordinary least squares. Compared with the elastic net, stability selection is more demanding of the predictors it selects. Consequently, it targets probes that mark differentiation events that are the most important for the cell. We therefore explore the biological functions associated with the stability selected probes, to improve knowledge of the epigenetic changes that characterize differentiated immune cells. A similar two-pronged approach is used to predict other conditions and traits collected within the Milieu Intérieur study, including age, smoking, height, BMI, routine chemical and hematological laboratory tests, and the serological responses to antigens of 13 common pathogens (22). Several of these traits have not previously been modelled using all DNAm probes jointly. Our study substantially improves predictions of blood cell composition based on DNA methylation profiles, which will be critical for applications in medical epigenomics, forensics and disease prognosis.

## Results

### Optimization of predictive models

To predict immune cell proportions with optimal accuracy, given our assumption of sparsity and linearity, we use elastic net regularization. It is controlled by two regularization parameters: *λ*, which controls the regularization that enforces sparsity on the coefficients, and *α*, which controls regularization that restricts the magnitude of the coefficients. We use 5 different values for *α* and 200 different values for *λ*. Each possible pair of *α* and *λ* parameter values give a different amount of predictors and regularization, and is a step in the so-called *regularization path*. We measure the prediction accuracy along the regularization path by the mean absolute error (MAE) and the correlation (*R*) between the hold-out sample values and the out-of-sample predictions in 10-fold, twice repeated two-dimensional cross-validation, described in Algorithm 1. The procedure gives 20 samples from the distribution of out-ofsample prediction accuracy along the regularization path. We use those samples to estimate the mean accuracy and its 95% confidence intervals.

The performance of models, together with the number of predictors that is optimal in terms of prediction accuracy, is shown in Table 1 for each cell proportion, as well as 23 other continuous traits, including age and morphometric and physiological measures. DNA methylation levels can accurately predict age and sex (18), intrinsic factors that are predictive of many traits, including immune cell counts in whole blood (14). It is therefore important to discern when predictors based on methylation probes give additional information to these two commonly available factors. For comparison, we therefore include in Table 1 the prediction accuracy of a linear model that only includes age and sex as predictors. We also build predictive models for binary phenotypes, including smoking status and serostatus for 13 different common infections, using elastic net regularization together with the cost function of the binomial likelihood with a logit link function. Similarly to the approach we use for the continuous traits, we estimate prediction accuracy in terms of model complexity using cross-validation. For binary traits we measure prediction accuracy by the classification rate, *i.e.*, the proportion of correct class predictions (probability threshold is taken at 0.5). Prediction accuracy for models with optimally many predictors for the binary traits are shown in Table 2.

### Blood cell deconvolution

Accurate estimations are obtained with elastic net regularized models for 35 immune cell proportions (estimated correlation between predicted and observed out-of-sample values *R*>0.6; Table 1). The four immune cells that we predict with the highest accuracy are CD8^{+} naive T cells, with a correlation between predicted and observed out-of-sample values of *R*=0.92 (95% CI: [0.87, 0.96]), using 312 predictors (95% CI: [295, 338]); B cells (*R*=0.90, 95% CI: [0.8, 0.96]) using 606 predictors (95% CI: [582,635]); CD8^{+} T cells (*R*=0.90,95% CI: [0.84, 0.94]) using 555 predictors (95% CI: [526, 591]); and natural killer (NK) cells (*R*=0.88, 95% CI: [0.71, 0.95]) using 1072 predictors (95% CI: [1036, 1126]). For most immune cell proportions, methylation levels clearly provide additional information in comparison to just age and sex.

A comparison of the performance of our elastic net models and the Houseman model, using either the standard or IDOL reference libraries (9, 12), is given in Table 3. Our models outperform the two models for the six major cell-types that they are currently able to estimate. The correlations between predicted and observed out-of-sample values are systematically higher for our models, relative to the Houseman model with either the default or IDOL reference library (Table 3). Furthermore, our models are less error-prone (Table 3). These findings suggest that elastic net regression models, trained on whole blood standardized cytometry data, can outperform constrained projection techniques based on reference values obtained in a limited number of isolated blood cell sub-types.

### Linear models selected by stability selection

We next evaluate how prediction accuracy varies with the number of predictors in our models. The regularization paths for the nine best predicted traits are shown in Figure 1. Interestingly, out-of-sample prediction error decreases rapidly with the number of predictors, and plateaus at around 50 predictors (Figure 1). This indicates that accurate predictions can be achieved with much fewer predictors than the hundreds of DNAm probes used by current prediction models of cell composition (9, 12) and age (18, 23). These results suggest that blood cell composition can be predicted well using only a few number of probes that are markers for differentiation events. To find such probes, we estimate a minimal robust predictor set using stability selection. We select and build the models on a subsample of 866 randomly selected individuals, and then evaluate on a hold-out sample of 96 randomly selected individuals.

The predictive accuracy of the stability-selected predictive models is high (Table 1) and comparable to that of elastic net regression models, while using considerably fewer predictors. Prediction performance is also apparent when predicted out-of-sample values are plotted against the observed values for the 16 most accurate models (Figure 2). For instance, using only six methylation probes, the correlation between estimated and observed values for T cells is *R*=0.77 and the MAE is lower than 3%. We verify that our stability selected models are competitive by comparing their prediction accuracy to that of the Houseman model using either the standard or IDOL reference panels. Although our models use only 15, 12, 13, 13, 14 and 3 predictors for B cells, CD4^{+} T cells, CD8^{+} T cells, monocytes, NK cells and neutrophils, respectively, they yield comparable out-of-sample correlations and lower MAE (Table 4), relative to current methods. Together, these results demonstrate that prediction models that use a dozen or fewer methylation probes selected by stability selection can achieve prediction accuracy comparable to that of gold-standard, reference-based cell deconvolution techniques that use hundreds of probes.

### Biological relevance of the stability selected methylation probes

Because blood cell proportions could be accurately predicted with just a dozen of DNAm probes, we next investigate the relevance of the stability- selected probes to cell biology. We find several, methylome-wide significant DNAm probes that are found close to, or within, genes with well-known functions in immune cell differentiation (Table 5). For instance, DNAm levels within *CD4*, *CD8A* and *CD8B* genes are associated with the CD4:CD8 ratio (*P*=3.9×10^{-11}), the proportion of CD8a^{+} NK cells (*P*=4.6×10^{-17}) and the proportion of CD8b^{+} T cells (*P*=1.6×10^{-9}), respectively. The proportion of neutrophils are associated with DNAm levels in the *PDE4B* gene body (*P*=1.8×10^{-8}), which plays a key role in neutrophil function (24). Similarly, the proportion of MAIT cells are associated with DNAm levels in the 5’UTR of *IL21R* (*P*=8.3×10^{-21}), which is known to regulate MAIT cell numbers (25). Several cell sub-types, including leukocytes, lymphocytes, monocytes and ILC, are associated with DNAm sites within *AHRR*, *F2RL3* and *GATA3* genes, which are known to be strongly affected by cigarette consumption (26–28). We consistently showed recently that circulating levels of these different blood cell subsets are significantly impacted by smoking status (14). Finally, a number of the selected DNAm probes have previously been associated with disease (Table 5). For instance, DNAm within the *ACSF3* gene is associated with the proportion of naive B cells (*P*=4.1×10^{-9}) and has been shown to be differentially methylated in B cells of patients with rheumatoid arthritis (29), suggesting that B cell subtype fractions are altered in these patients. Together, these findings support stability selection as a robust tool to select relevant associated variables, and illustrate the biological relevance of DNAm probes selected as predictors of immune cell proportions.

### Prediction of other factors

Among the other quantitative factors assessed in the Milieu Intérieur cohort, prediction by the elastic net method is the most accurate for age (Table 1). Using 701 predictors (95% CI: [673, 749]), we estimate age with an MAE of 1.67 years (95% CI: [1.5, 1.88]), confirming that it can be estimated from DNAm with high accuracy (18, 23). From Table 1 it appears that our elastic net models are also able to estimate red blood cell counts, height and weight with high accuracy. However, a comparison with the model that only uses age and sex reveals that the predictive power of methylation levels for these two traits probably mostly stem from their ability to predict age and sex.

We next evaluate the accuracy of elastic net models to predict, based on DNAm data, smoking status and the serostatus for 13 common infections, including infections by *Toxoplasma gondii*, *Helicobacter pylori*, cytomegalovirus (CMV), Epstein-Barr virus (EBV), hepatitis B virus (HBV), Herpes Simplex virus (HSV), Varicella Zoster virus (VZV), mumps virus and measles virus (22). Binary traits for which the prediction of the most prevalent condition outperforms the naive prediction are shown in Table 2. We obtain good prediction results for smoking consumption and CMV serostatus, which is natural considering that both factors have been shown to broadly affect immune cell variation (14). The out-of-sample classifications for both of these traits are correct almost 90% of the time. The estimated regularization paths for the different binary traits are shown in Figure 3. which indicate that near optimal prediction can be achieved with less than 50 DNAm probes. The optimal classification rate for CMV serostatus is *CR*=87% (95% CI: [81%, 94%]) using 256 predictors, while for smoking, *CR*=89% (CI: [82%, 94%]) using 193 predictors.

We also select a robust minimal set of predictors using stability selection for the binary phenotypes. Models are selected and fitted using the same training set of 866 samples as for continuous traits, and then evaluated on the 96 holdout samples. The prediction accuracy of the models is shown in Table 2. Interestingly, the stability selected model for CMV performs slightly better than the elastic net model, using only 13 probes, while the selected model for smoking performs notably worse. This indicates that the relationship between DNAm and smoking is less sparse than that for DNAm and CMV serostatus. Methylome-wide significant probes selected for smoking are well known DNAm sites predictive of cigarette consumption (Table 6). We find that HBV, *T. gondii* and HVS1 infections associate with DNAm sites close to *EVOLV2* and *KLF14* genes, known to be strongly associated with age. This suggests no effects of these infections on DNAm besides that of age, with which they are themselves strongly correlated (22). More interestingly, DNAm associated with *H. pylori* seropositivity is found within the poliovirus receptor-like 3 gene (*P*=4.6×10^{-12}), an intestinal epithelium receptor for bacterial toxins (30), suggesting a role of this protein in *H. pylori* infection.

## Discussion

Our study reports novel, accurate models to predict blood cell composition from whole blood DNAm profiles. Models were built using a unique dataset that comprises both the quantification of 70 blood cell proportions by standardized flow cytometry (14) and blood methylomes established with the MethylationEPIC array (15), assessed in 962 healthy donors of western European ancestry. Predictive models are built using the elastic net method (17), a regularized linear regression model that has been recently used to predict age from MethylationEPIC array data (18). The prediction accuracy, measured as the correlation between predicted and observed out-of-sample values and the MAE, is improved for our models, compared to the widely-used Houseman model, based on either the standard or improved IDOL reference libraries (9, 12). We are also able to accurately predict 35 subset frequencies, in contrast to the six that are currently possible to estimate by the Houseman model using either reference panel. These results suggest that our models should better prevent false positives in EWAS due to cellular heterogeneity, relative to existing gold-standard methods. Nevertheless, it must be noted that we assessed prediction accuracy based on cellular fractions estimated with the same flow cytometry technique, panel design and standardization steps as those used for the training dataset, which may disfavor the other methods trained on other types of cell enumeration techniques.

We also show that it is possible to find predictive models of immune cell proportions that are comparable in terms of accuracy to elastic net models, and to the Houseman models with either reference library, using considerably fewer predictors. This is done by employing the stability selection technique (20, 21). Because of their much smaller size, such models can more robustly, flexibly and cost-effectively predict blood cell composition, age, and smoking consumption than previous models.

Thanks to the exhaustive immunophenotyping performed in our training dataset, we can extend the number of blood cell subsets that can be accurately predicted from blood DNAm data. Notably, our models can accurately predict the blood frequencies of MAIT cells, eosinophils, basophils and T_{reg} cells (*R*>0.6; Table 1). Importantly, all these leukocyte subsets have previously been reported to vary with various disease conditions, and are thus expected to confound interpretation of EWAS. For instance, circulating levels of MAIT cells are known to be strongly altered during infection (31) and in systemic lupus erythematosus and rheumatoid arthritis patients (32). Eosinophil numbers change with exposure to allergens and in asthmatic patients (33). Similarly, T_{reg} populations and sub-populations show altered frequencies in several autoimmune and allergic diseases (34). Therefore, adjusting for these newly-predicted cell populations may improve correction for cellular heterogeneity in epigenomic studies of immune-related disorders. More generally, we envisage that prediction models of blood cellular composition could also be employed to better understand disease pathophysiology *per se*. While EWAS assume that disease-associated DNAm sites affect the transcriptional reprogramming of already differentiated cells, there is increasing evidence that diseases can also be caused by stable alterations of cellular repertoires, a phenomenon recently referred to as polycreodism (4). We suggest that model-based estimation of blood cell composition in large longitudinal cohorts, for which methylomes but no flow cytometric measurements exist, will represent a powerful new approach to evaluate whether perturbations in cell proportions can predict disease outcome.

## Methods

### DNA methylation data

The *Milieu Intérieur* cohort includes 1,000 healthy donors who were recruited by BioTrial (Rennes, France) and were stratified by gender (*i.e.*, 500 women and 500 men) and age (*i.e.*, 200 individuals from each decade of life, between 20 and 70 years of age). Donors were selected based on stringent inclusion and exclusion criteria, detailed elsewhere (13). DNAm data was retrieved for all donors from a previous study (15), where detailed methods are provided. In brief, the DNA methylome was profiled with the Infinium MethylationEPIC BeadChip on whole blood-derived samples. Raw fluorescence intensities of 866,895 methylation sites across the human genome were processed with the *R* (version 3.5) *Bioconductor* package *minfi*. Values were corrected for probe color bias and differences in type-I and type-II probe distributions, using the single sample NOOB (ssNOOB) method implemented in minfi. Because we wanted to use the methylation data primarily for prediction, which can easily be evaluated on out-of-sample observations and in validation cohorts, we wanted to exclude as few probes as possible. Therefore, we did not exclude probes from the X and Y chromosomes. We did neither exclude possibly cross-reactive probes. From the 866,895 initial probes, we only excluded probes that had a *detection P* ≥ 0.01 for *more* than 3 samples. A total of 858,923 probes were kept for the analyses. We suppose in this study that DNAm levels are linearly related to cell proportions. We therefore use β methylation values instead of *m* values.

### Flow cytometry data

Flow cytometry data was retrieved for all *Milieu Intérieur* donors from a previous study (14), where detailed methods are provided. Briefly, whole blood samples were collected from the 1,000 healthy, fasting donors on Li-heparin. Sample staining was performed within 6h of blood draw.Ten 8-color flow cytometry panels were developed. The acquisition of cells was performed using two MACSQuant analyzers, which were calibrated using MacsQuant calibration beads. Flow cytometry data were generated using MACSQuantify™ software. Among the 313 exported immunophenotypes, we only kept 70 cell proportions and 2 ratios as candidate measures for prediction.

### Houseman model using standard and IDOL reference libraries

We used the implementation of the Houseman model in the *EstimateCellCounts2* function of the *Bioconductor R* package *FlowSorted.Blood.EPIC* to predict immune cell proportions for all our 962 samples with both the default and IDOL reference panels.

### Statistical modeling

We suppose that there are DNAm CpG sites in the genome of a cell that mark a particular cellular lineage, in the sense that the methylation state of these sites are specific to the cells belonging to that lineage. Therefore, we expect the state of methylation at a number of CpG sites to mark the identity of a particular blood cell. In whole blood, the percentage of cells that are methylated at such DNAm sites should be linearly related to the proportion of the cell in the blood. We further suppose that it is primarily such DNAm sites, and sites related to them, that are predictive of blood cell proportions. We therefore use a linear model to predict blood cell proportions from DNA methylation levels in whole blood. Let *, P* = 858923 denote our observations of the percentage of methylation at all measured DNAm sites. Let *C* ≪ *P* be the number of sites that are related to a differentiation event that offers information on the identity of a particular blood cell. This could be a primary event that directly determines cell identity, or it could be an event that gives information on the identity of the cell because of the correlation structure with other cells or genetic and environmental factors. We expect that only few sites correspond to primary events and we further expect the average methylation at such sites to be highly predictive of the immune cell proportion whose lineage it marks. We expect more events that offer correlational information on the proportion of immune cells. Typically, such sites are distributed according to a long tail of decreasing predictive power. Let *D* ≪ *C* be the number of sites that correspond to primary differentiation events for a particular blood cell. To summarize: for a particular blood cell, we are targeting two sets of probes, *S ^{C}* and

*S*such that

^{D}and we suppose a predominantly linear relationship between these variables and the cell proportion. We are therefore looking for sparse linear models, where the coefficients of the predictors in *S ^{P}* \

*S*is set to zero. We employ two different strategies to target the predictors in

^{C}*SC*and

*SD*. Let

*n*= 962 be our sample size. We expect

*D*≪

*n*, but do not necessarily suppose that

*C < n*. For

*S*we therefore need to

^{C}*select*predictors, but the linear regression equation system could still be overdetermined, so we also need to

*regularize*the coefficients of the fitted linear model. To do this, we employ elastic net regularization (17). In the case of

*S*, we only want to

^{D}*select*predictors and then fit an unbiased least squares regression model. We achieve this by using the stability selection technique (20, 21).

### Elastic net regression

Let now be the matrix corresponding to *S ^{P}*, with the methylation percentages as columns. Furthermore, let be observations of a cell proportion. The elastic net cost function combines the least squares term with two regularization terms on the magnitude of the coefficients for the columns in

*X*

The parameter *α* chooses between a pure Euclidian norm squared penalty, , corresponding to ridge regression at *α* = 0 and a pure norm, ∥*β*∥_{1}, penalty corresponding to the LASSO (16) penalty at *α* = 1. If *α* ≠ 0 then the estimator in Eq. (1) will do a selection: coefficients that do not rise above a noise floor will be put to exactly zero. The pure LASSO penalty has a *saturation* property: it cannot select more predictors than the number of samples (35). Note that for the pure LASSO penalty, all coefficients will be zero if

To target *X ^{C}*, we suppose that an

*α*between zero and one will be optimal. To find this parameter, we employ our own cross-validation scheme, detailed in Algorithm 1. We fit the optimization problem Eq. (1) by the

*glmnet*package in

*R*.

### Stability selected linear regression

Elastic net regression with regularization parameters tuned by cross-validation will typically include predictors of weak predictive power as well as some false positives (36). To target *SD*, we therefore use a more stringent selection scheme. As mentioned above, we suppose that *D* ≪ *n*. Therefore, we are now only aiming to select predictors to use in a linear model; we do not want to regularize the parameters. Define the *support S* of a linear model by

First we introduce a weak support estimator. This estimator uses the cost function in Eq. (1) with *α* fixed at 0.8, while keeping *λ* large enough so that it never includes more than *q* variables. Given this constraint, the support is then estimated to be the included variables. To be more precise, introduce the family of support estimators

We then use the support estimator *Ŝ _{q}* =

*Ŝ*(

*λ*

^{*}), where

*λ*

^{*}is such that

To find *S ^{D}*, we wrap this weak support estimator in a sub-sampling scheme known as stability selection. The full scheme is outlined in Algorithm 2. Let

*X*be the columns of

_{ss}*X*corresponding to predictors selected by the stability selection scheme. The coefficient estimates for the final linear regression model of the immune cell proportion with measurements in

*y*is then

We use the implementation of stability selection in the *stabs R* package (37).

### Other traits

The models above were developed primarily for immune cell proportions, but we use them also for the other traits. We suppose that most of the predictive power of whole blood DNA methylation for any trait comes from its intimate link with immune cell proportions. Therefore, we anticipate that prediction models of a form suitable for immune cell frequencies should work well also for traits related to them.

For binary traits, code the classes as either 0 or 1. The procedure we use for binary traits follows the algorithms above verbatim, except that the least squares term in the cost function in Eq. (1) is replaced by the negative log-likelihood of the binomial distribution given a logit link function

Logistic regression with elastic net regularization is implemented in *glmnet*. For stability selection, we use the *stabs R* package with a custom built selection function based on *glmnet*.

## ACKNOWLEDGEMENTS

This work benefted from support of the French government’s Program *Investissement d’Avenir*, managed by the Agence Nationale de la Recherche (ANR, reference 10-LABX-69-01). J.B. is a member of the LCCC Linnaeus Center and the ELLIIT Excellence Center at Lund University and is supported by the ELLIIT Excellence Center.