RT Journal Article
SR Electronic
T1 <em>VariantSpark</em>, A <em>Random Forest</em> Machine Learning Implementation for Ultra High Dimensional Data
JF bioRxiv
FD Cold Spring Harbor Laboratory
SP 702902
DO 10.1101/702902
A1 Arash Bayat
A1 Piotr Szul
A1 Aidan R. O’Brien
A1 Robert Dunne
A1 Oscar J. Luo
A1 Yatish Jain
A1 Brendan Hosking
A1 Denis C. Bauer
YR 2019
UL http://biorxiv.org/content/early/2019/07/15/702902.abstract
AB The demands on machine learning methods to cater for ultra high dimensional datasets, datasets with millions of features, have been increasing in domains like life sciences and the Internet of Things (IoT). While Random Forests are suitable for “wide” datasets, current implementations such as Google’s PLANET lack the ability to scale to such dimensions. Recent improvements by Yggdrasil begin to address these limitations but do not extend to Random Forest. This paper introduces CursedForest, a novel Random Forest implementation on top of Apache Spark and part of the VariantSpark platform, which parallelises processing of all nodes over the entire forest. CursedForest is 9 and up to 89 times faster than Google’s PLANET and Yggdrasil, respectively, and is the first method capable of scaling to millions of features.