Abstract
Background CpG methylation levels can help to explain inter-individual differences in phenotypic traits. Few studies have explored whether identifying CpG subsets based on biological and statistical properties can maximise predictions while minimising array content.
Methods Variance component analyses and penalised regression (epigenetic predictors) were used to test the influence of (i) the number of CpGs considered, (ii) mean CpG methylation variability and (iii) methylation QTL status on the variance captured in eighteen traits by blood DNA methylation. Training and test sets comprised ≤4,450 and ≤2,578 unrelated individuals from Generation Scotland, respectively.
Results As the number of CpG sites under consideration decreased, so too did the estimates from the variance components and prediction analyses. Methylation QTL status and mean CpG variability did not influence variance components. However, relative effect sizes were 15% larger for epigenetic predictors based on CpGs with methylation QTLs compared to sites without methylation QTLs. Relative effect sizes were 45% larger for predictors based on CpGs with mean beta-values between 10%-90% compared to those using hypo- or hypermethylated CpGs (beta-value ≤10% or ≥90%).
Conclusion Arrays with fewer CpGs could reduce costs, leading to increased sample sizes for analyses. Our results show that reducing array content can restrict prediction metrics and careful attention must be given to the biological and distribution properties of CpGs in array content selection.
Competing Interest Statement
R.F.H has received consultant fees from Illumina. R.E.M. has received speaker fees from Illumina and is an advisor to the Epigenetic Clock Development Foundation. The remaining authors declare that they have no competing interests.
Footnotes
Postal address: Centre for Genomic and Experimental Medicine, Institute of Genetics and Cancer, Crewe Road South, University of Edinburgh, EH4 2XU, Edinburgh, UK.