Abstract
Metagenomic datasets contain billions of protein sequences that could greatly enhance large-scale prediction of protein functions and structures. Lin-clust can cluster sequences down to 50% pairwise sequence similarity and its runtime scales linearly with the input set size, not nearly quadratically as in conventional algorithms. We cluster 1.7 billion sequences from ~2200 metagenomic and metatran-scriptomic datasets in 30 hours on 28 cores. Free software and data available at https://mmseqs.com/.
Copyright
The copyright holder for this preprint is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under a CC-BY-NC-ND 4.0 International license.