Abstract
Since its introduction in 1990 and with over 50k citations, the NCBI BLAST family has been an essential tool of in silico molecular biology. The BLAST nt database, based on the traditional divisions of GenBank, has been the default and most comprehensive database for nucleotide BLAST searches and for taxonomic classification software in metagenomics. Here we argue that this is no longer the case. Currently, the NCBI WGS database contains one billion reads (almost five times more than GenBank), and with 4.4 trillion nucleotides, WGS has about 14 times more nucleotides than GenBank. This ratio is growing with time. We advocate a change in the database paradigm in taxonomic classification by systematically combining the nt and WGS databases in order to boost taxonomic classifiers sensitivity. We present here a case in which, by adding WGS data, we obtained over five times more classified reads and with a higher confidence score. To facilitate the adoption of this approach, we provide the draftGenomes script.
Author summary Culture-independent methods are revolutionizing biology. The NIH/NCBI Basic Local Alignment Search Tool (BLAST) is one of the most widely used methods in computational biology. The BLAST nt database has become a de facto standard for taxonomic classifiers in metagenomics. We believe that it is time for a change in the database paradigm for such a classification. We advocate the systematic combination of the BLAST nt database with genomes of the massive NCBI Whole-Genome Shotgun (WGS) database. We make draftGenomes available, a script that eases the adoption of this approach. Current developments and technologies make it feasible now. Our recent results in several metagenomic projects indicate that this strategy boosts the sensitivity in taxonomic classifications.
Footnotes
Abbreviations
- BLAST
- Basic Local Alignment Search Tool
- DDBJ
- DNA Data Bank of Japan
- EMBL
- European Molecular Biology Laboratory
- EST
- Expressed Sequence Tags NCBI database
- GSS
- Genome Survey Sequences NCBI database
- HTG
- High-throughput unfinished genome sequences NCBI database
- NCBI
- National Center for Biotechnology Information
- NIH
- National Institute of Health
- nt
- Partially non-redundant nucleotide sequences NCBI database
- PAT
- Patented sequences NCBI database
- SCM
- Storage Class Memory
- STS
- Sequence Tagged Sites NCBI database
- WGS
- Whole-Genome Shotgun NCBI database