Abstract
Metagenomic datasets contain billions of protein sequences that could greatly enhance large-scale functional annotation and structure prediction. But clustering them with current algorithms is imprac tical because runtimes depend almost quadratically on input set size. Linclust's linear scaling over-comes this limitation, enabling us to cluster and assemble 1.6 billion sequence fragments from 2200 metagenomic datasets in (10 + 30) hours on 28 cores into 711 million sequences. (Open-source software and Metaclust database: https://mmseqs.org/).
Copyright
The copyright holder for this preprint is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under a CC-BY-NC-ND 4.0 International license.