RT Journal Article
SR Electronic
T1 Supervised-learning is an accurate method for network-based gene classification
JF bioRxiv
FD Cold Spring Harbor Laboratory
SP 721423
DO 10.1101/721423
A1 Renming Liu
A1 Christopher A Mancuso
A1 Anna Yannakopoulos
A1 Kayla A Johnson
A1 Arjun Krishnan
YR 2019
UL http://biorxiv.org/content/early/2019/08/05/721423.abstract
AB Background Assigning every human gene to specific functions, diseases, and traits is a grand challenge in modern genetics. Key to addressing this challenge are computational methods such as supervised-learning and label-propagation that can leverage molecular interaction networks to predict gene attributes. In spite of being a popular machine learning technique across fields, supervised-learning has been applied only in a few network-based studies for predicting pathway-, phenotype-, or disease-associated genes. It is unknown how supervised-learning broadly performs across different networks and diverse gene classification tasks, and how it compares to label-propagation, the widely-benchmarked canonical approach for this problem.Results In this study, we present a comprehensive benchmarking of supervised-learning for network-based gene classification, evaluating this approach and a state-of-the-art label-propagation technique on hundreds of diverse prediction tasks and multiple networks using stringent evaluation schemes. We demonstrate that supervised-learning on a gene’s full network connectivity outperforms label-propagation and achieves high prediction accuracy by efficiently capturing local network properties, rivaling label-propagation’s appeal for naturally using network topology. We further show that supervised-learning on the full network is also superior to learning on node-embeddings (derived using node2vec), an increasingly popular approach for concisely representing network connectivity.Conclusion These results show that supervised-learning is an accurate approach for prioritizing genes associated with diverse functions, diseases, and traits and should be considered a staple of network-based gene classification workflows. The datasets and the code used to reproduce the results and add new gene classification methods have been made freely available.Contact arjun{at}msu.edu