TY - JOUR T1 - Large-scale tandem mass spectrum clustering using fast nearest neighbor searching JF - bioRxiv DO - 10.1101/2021.02.05.429957 SP - 2021.02.05.429957 AU - Wout Bittremieux AU - Kris Laukens AU - William Stafford Noble AU - Pieter C. Dorrestein Y1 - 2021/01/01 UR - http://biorxiv.org/content/early/2021/02/08/2021.02.05.429957.abstract N2 - Rationale Advanced algorithmic solutions are necessary to process the ever increasing amounts of mass spectrometry data that is being generated. Here we describe the falcon spectrum clustering tool for efficient clustering of millions of MS/MS spectra.Methods falcon succeeds in efficiently clustering large amounts of mass spectral data using advanced techniques for fast spectrum similarity searching. First, high-resolution spectra are binned and converted to low-dimensional vectors using feature hashing. Next, the spectrum vectors are used to construct nearest neighbor indexes for fast similarity searching. The nearest neighbor indexes are used to efficiently compute a sparse pair-wise distance matrix without having to exhaustively compare all spectra to each other. Finally, density-based clustering is performed to group similar spectra into clusters.Results Using a large draft human proteome dataset consisting of 25 million spectra, falcon is able to generate clusters of a similar quality as MS-Cluster and spectra-cluster, two widely used clustering tools, while being considerably faster. Notably, at comparable cluster quality levels, falcon generates larger clusters than alternative tools, leading to a larger reduction in data volume without the loss of relevant information for more efficient downstream processing.Conclusions falcon is a highly efficient spectrum clustering tool. It is publicly available as open source under the permissive BSD license at https://github.com/bittremieux/falcon.Competing Interest StatementThe authors have declared no competing interest. ER -