TY - JOUR T1 - Terminating contamination: large-scale search identifies more than 2,000,000 contaminated entries in GenBank JF - bioRxiv DO - 10.1101/2020.01.26.920173 SP - 2020.01.26.920173 AU - Martin Steinegger AU - Steven L Salzberg Y1 - 2020/01/01 UR - http://biorxiv.org/content/early/2020/01/26/2020.01.26.920173.abstract N2 - Metagenomic sequencing allows researchers to investigate organisms sampled from their native environments by sequencing their DNA directly, and then quantifying the abundance and taxonomic composition of the organisms thus captured. However, these types of analyses are sensitive to contamination in public databases caused by incorrectly labeled reference sequences. Here we describe Conterminator, an efficient method to detect and remove incorrectly labelled sequences by an exhaustive all-against-all sequence comparison. Our analysis reports contamination in 114,035 sequences and 2767 species in the NCBI Reference Sequence Database (RefSeq), 2,161,746 sequences and 6795 species in the GenBank database, and 14,132 protein sequences in the NR non-redundant protein database. Conterminator uncovers contamination in sequences spanning the whole range from draft genomes to “complete” model organism genomes. Our method, which scales linearly with input size, was able to process 3.3 terabytes of genomic sequence data in 12 days on a single 32-core compute node. We believe that Conterminator can become an important tool to ensure the quality of reference databases with particular importance for downstream metagenomic analyses. Source code (GPLv3): https://github.com/martin-steinegger/conterminator ER -