RT Journal Article SR Electronic T1 Shared Nearest Neighbor clustering in a Locality Sensitive Hashing framework JF bioRxiv FD Cold Spring Harbor Laboratory SP 093898 DO 10.1101/093898 A1 Sawsan Kanj A1 Thomas Brüls A1 Stéphane Gazut YR 2016 UL http://biorxiv.org/content/early/2016/12/15/093898.abstract AB We present a new algorithm to cluster high dimensional sequence data, and its application to the field of metagenomics, which aims to reconstruct individual genomes from a mixture of genomes sampled from an environ-mental site, without any prior knowledge of reference data (genomes) or the shape of clusters. Such problems typically cannot be solved directly with classical approaches seeking to estimate the density of clusters, e.g., using the shared nearest neighbors rule, due to the prohibitive size of contemporary sequence datasets. We explore here a new method based on combining the shared nearest neighbor (SNN) rule with the concept of Locality Sensitive Hashing (LSH). The proposed method, called LSH-SNN, works by randomly splitting the input data into smaller-sized subsets (buckets) and, employing the shared nearest neighbor rule on each of these buckets. Links can be created among neighbors sharing a sufficient number of elements, hence allowing clusters to be grown from linked elements. LSH-SNN can scale up to larger datasets consisting of millions of sequences, while achieving high accuracy across a variety of sample sizes and complexities.