Prioritization of disease genes from GWAS using ensemble based positive-unlabeled learning

Nikita Kolosov; Mark J. Daly; Mykyta Artomov

doi:10.1101/2020.07.12.199273

Abstract

Major complication in understanding disease biology from GWAS arises from inability to identify a complete set of causal genes. Integration of multiple omics data sources could provide an important functional link between associated variants and candidate genes. Machine-learning could take advantage of this variety of data and provide a solution for prioritization of disease genes. Yet, classical positive-negative classifiers impose strong limitations on the gene prioritization procedure, such as lack of reliable non-causal genes for training.

Here, we developed a novel gene prioritization tool - Gene Prioritizer (GPrior). It is an ensemble of five positive-unlabeled bagging classifiers, that treat all genes of unknown relevance as an unlabeled set. GPrior selects an optimal combination of algorithms to tune the model for each specific phenotype.

Altogether, GPrior fills an important niche of methods for GWAS data post-processing, significantly improving the ability to pinpoint disease genes compared to existing solutions.

Competing Interest Statement

The authors have declared no competing interest.

Footnotes

Authors declare no conflict of interests.
https://github.com/faramer86/GPrior

The copyright holder for this preprint is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under a CC-BY-ND 4.0 International license.