Abstract
Deep phenotyping can enhance the power of genetic analysis such as genome-wide association study (GWAS), but recurrence of missing phenotypes compromises the potentials of such resources. Although many phenotypic imputation methods have been developed, accurate imputation for millions of individuals still remains extremely challenging. In the present study, leveraging efficient machine learning (ML)-based algorithms, we developed a novel multi-phenotype imputation method based on mixed fast random forest (PIXANT), which is several orders of magnitude in runtime and computer memory usage than the state-of-the-art methods when applied to the UK Biobank (UKB) data and scalable to cohorts with millions of individuals. Our simulations with hundreds of individuals showed that PIXANT was superior to or comparable to the most advanced methods available in terms of accuracy. We also applied PIXANT to impute 425 phenotypes for the UKB data of 277,301 unrelated white British citizens and performed GWAS on imputed phenotypes, and identified a 15.6% more GWAS loci than before imputation (8,710 vs 7,355). Due to the increased statistical power of GWAS, a certain proportion of novel genes were rediscovered, such as RNF220, SCN10A and RGS6 that affect heart rate, demonstrating the use of imputed phenotype data in a large cohort to discover novel genes for complex traits.
Competing Interest Statement
The authors have declared no competing interest.
Footnotes
Updated article;Figures revised;author affiliations updated; Supplemental tables updated.