ABSTRACT
Taxonomy assignment for microbial community composition studies can be limited by a lack of relevant reference organisms in large taxonomy databases. TaxAss is a taxonomy assignment workflow to classify 16S rRNA gene amplicon data using two taxonomy reference databases: a large comprehensive database such as Greengenes or SILVA, and a small ecosystem-specific database curated by scientists within a field. To test TaxAss performance, we classified five different freshwater datasets using the comprehensive Greengenes database and the freshwater-specific FreshTrain database. TaxAss increased the percent of the dataset classified compared to using only Greengenes. The increase in classifications was highest at fine-resolution taxa levels, where across the freshwater test-datasets the classifications at species-level increased by 24-40 percent reads. A similar increase in classifications was not observed in a control mouse gut dataset, which was not expected to contain freshwater bacteria. TaxAss maintained taxonomic richness compared to using only the FreshTrain. Richness was maintained across all taxa-levels from phylum to species. Without TaxAss, the majority of organisms not represented in the FreshTrain were unclassified, but at finer taxa levels incorrect classifications were also significant. TaxAss splits a dataset's sequences into two groups based on their percent identity to reference sequences in the ecosystem-specific database. Highly similar sequences are classified using the ecosystem-specific database and the others are classified using the comprehensive database. TaxAss metrics help users choose a percent identity cutoff appropriate for their data. TaxAss is free and open source, and available at www.github.com/McMahonLab/TaxAss.
IMPORTANCE Microbial communities drive ecosystem processes, but microbial community composition analyses using 16S rRNA gene amplicon datasets is limited by the lack of fine-resolution taxonomy classifications. Course taxonomic groupings at phylum, class, and order level lump ecologically distinct organisms together. To avoid this, many researchers cluster similar sequences into operational taxonomic units (OTUs). These fine-resolution groupings are more ecologically relevant, but OTU definitions are dataset-dependent and therefore cannot be compared between datasets. Microbial ecologists studying a variety of environments have curated small, ecosystem-specific taxonomy databases to provide consistent and up-to-date terminology. We created TaxAss, a workflow that leverages these ecosystem-specific databases to assign taxonomy. We found that TaxAss improves fine-resolution taxonomic classifications (family, genus and species). Fine taxonomic groupings are also ecologically relevant, so they provide an alternative to OTU-based analyses that is consistent and comparable between datasets. TaxAss enables researchers to compare data using ecologically relevant terminology.