TY - JOUR T1 - SPADIS: An Algorithm for Selecting Predictive and Diverse SNPs in GWAS JF - bioRxiv DO - 10.1101/256677 SP - 256677 AU - Serhan Yilmaz AU - Oznur Tastan AU - A. Ercument Cicek Y1 - 2018/01/01 UR - http://biorxiv.org/content/early/2018/03/26/256677.abstract N2 - Phenotypic heritability of complex traits and diseases is seldom explained by individual genetic variants. Algorithms that select SNPs which are close and connected on a biological network have been successful in finding biologically-interpretable and predictive loci. However, we argue that the connectedness constraint favors selecting redundant features that affect similar biological processes and therefore does not necessarily yield better predictive performance. In this paper, we propose a novel method called SPADIS that selects SNPs that cover diverse regions in the underlying SNP-SNP network. SPADIS favors the selection of remotely located SNPs in order to account for the complementary additive effects of SNPs that are associated with the phenotype. This is achieved by maximizing a submodular set function with a greedy algorithm that ensures a constant factor (1−1/e) approximation. We compare SPADIS to the state-of-the-art method SConES, on a dataset of Arabidopsis Thaliana genotype and continuous flowering time phenotypes. SPADIS has better regression performance in 12 out of 17 phenotypes on average, it identifies more candidate genes and runs faster. We also investigate the use of Hi-C data to construct SNP-SNP network in the context of SNP selection problem for the first time, which yields slight but consistent improvements in regression performance. SPADIS is available at http://ciceklab.cs.bilkent.edu.tr/spadis ER -