RT Journal Article SR Electronic T1 Linclust: clustering protein sequences in linear time JF bioRxiv FD Cold Spring Harbor Laboratory SP 104034 DO 10.1101/104034 A1 Martin Steinegger A1 Johannes Söding YR 2017 UL http://biorxiv.org/content/early/2017/01/29/104034.abstract AB Metagenomic datasets contain billions of protein sequences that could greatly enhance large-scale prediction of protein functions and structures. Lin-clust can cluster sequences down to 50% pairwise sequence similarity and its runtime scales linearly with the input set size, not nearly quadratically as in conventional algorithms. We cluster 1.7 billion sequences from ~2200 metagenomic and metatran-scriptomic datasets in 30 hours on 28 cores. Free software and data available at https://mmseqs.com/.