RT Journal Article SR Electronic T1 VariantSpark, A Random Forest Machine Learning Implementation for Ultra High Dimensional Data JF bioRxiv FD Cold Spring Harbor Laboratory SP 702902 DO 10.1101/702902 A1 Arash Bayat A1 Piotr Szul A1 Aidan R. O’Brien A1 Robert Dunne A1 Oscar J. Luo A1 Yatish Jain A1 Brendan Hosking A1 Denis C. Bauer YR 2019 UL http://biorxiv.org/content/early/2019/07/15/702902.abstract AB The demands on machine learning methods to cater for ultra high dimensional datasets, datasets with millions of features, have been increasing in domains like life sciences and the Internet of Things (IoT). While Random Forests are suitable for “wide” datasets, current implementations such as Google’s PLANET lack the ability to scale to such dimensions. Recent improvements by Yggdrasil begin to address these limitations but do not extend to Random Forest. This paper introduces CursedForest, a novel Random Forest implementation on top of Apache Spark and part of the VariantSpark platform, which parallelises processing of all nodes over the entire forest. CursedForest is 9 and up to 89 times faster than Google’s PLANET and Yggdrasil, respectively, and is the first method capable of scaling to millions of features.