TY - JOUR T1 - Linclust: clustering billions of protein sequences per day on a single server JF - bioRxiv DO - 10.1101/104034 SP - 104034 AU - Martin Steinegger AU - Johannes Söding Y1 - 2017/01/01 UR - http://biorxiv.org/content/early/2017/05/25/104034.abstract N2 - Metagenomic datasets contain billions of protein sequences that could greatly enhance large-scale functional annotation and structure prediction. But clustering them with current algorithms is imprac tical because runtimes depend almost quadratically on input set size. Linclust's linear scaling over-comes this limitation, enabling us to cluster and assemble 1.6 billion sequence fragments from 2200 metagenomic datasets in (10 + 30) hours on 28 cores into 711 million sequences. (Open-source software and Metaclust database: https://mmseqs.org/). ER -