Linclust: clustering billions of protein sequences per day on a single server

Martin Steinegger; Johannes Söding

doi:10.1101/104034

Abstract

Metagenomic datasets contain billions of protein sequences that could greatly enhance large-scale functional annotation and structure prediction. But clustering them with current algorithms is imprac tical because runtimes depend almost quadratically on input set size. Linclust's linear scaling over-comes this limitation, enabling us to cluster and assemble 1.6 billion sequence fragments from 2200 metagenomic datasets in (10 + 30) hours on 28 cores into 711 million sequences. (Open-source software and Metaclust database: https://mmseqs.org/).

The copyright holder for this preprint is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under a CC-BY-NC-ND 4.0 International license.