Summary
MMseqs2 taxonomy is a new tool to assign taxonomic labels to metagenomic contigs. It extracts all possible protein fragments from each contig, quickly retains those that can contribute to taxonomic annotation, assigns them with robust labels and determines the contig’s taxonomic identity by weighted voting. Its fragment extraction step is suitable for the analysis of all domains of life. MMseqs2 taxonomy is 2-18x faster than state-of-the-art tools and also contains new modules for creating and manipulating taxonomic reference databases as well as reporting and visualizing taxonomic assignments.
Availability MMseqs2 taxonomy is part of the MMseqs2 free open-source software package available for Linux, macOS and Windows at https://mmseqs.com.
Contact eli.levy.karin{at}gmail.com
I. INTRODUCTION
Metagenomic studies shine a light on previously unstudied parts of the tree of life. However, unraveling taxonomic composition accurately and quickly remains a challenge. While most methods label short metagenomic reads (reviewed in [11]), only a handful (e.g. [6]) assign entire contigs, even though this should lead to improved accuracy.
Recently, [12] developed CAT, a tool for taxonomic annotation of contigs based on protein homologies to a reference database. It combines Prodigal [7] for predicting open reading frames (ORFs), DIAMOND [3] to search with the translated ORFs, and logic to aggregate individual ORF annotations. CAT achieved higher precision than state-of-the-art tools on bacterial benchmarks. Despite its advantage over existing methods, CAT has limitations: (1) Prodigal was designed for prokaryotes and not eukaryotes [13]; (2) Prodigal runs single-threaded, limiting applicability to metagenomics; (3) CAT’s r parameter determines the cut-off score below each ORF’s top-hit above which hits are included in the ORF’s lowest common ancestor (LCA) computation. Although the authors provide guidelines to set r, it is unclear how general they are.
Here we present MMseqs2 taxonomy, a novel protein-search-based tool for taxonomy assignment to contigs. It overcomes the aforementioned limitations by extracting all possible protein fragments, covering the coding repertoire of all domains of life. It quickly eliminates fragments that do not bear minimal similarity to the reference database, and searches with the remaining ones. MM-seqs2 taxonomy uses an approximate 2bLCA [5] strategy to assign translated fragments to taxonomic nodes (Supp. Inf.). The hits for the a2bLCA computation are determined automatically, saving the need to tune an equivalent of CAT’s r parameter. It outperforms CAT on bacterial and eukaryotic data sets.
II. METHODS
Input
Contigs are provided as (compressed) FASTA/Q files. As reference, the databases workflow can download and prepare various public taxonomy databases, such as, nr [1], UniProt [2] or GTDB [10]. Alternatively, users can prepare their own taxonomic reference database (see MMseqs2 wiki).
Output
MMseqs2 taxonomy returns the following eight fields for each contig accession: (1) the taxonomic identifier (taxid) of the assigned label, (2) rank, (3) name, followed by the number of fragments: (4) retained, (5) taxonomically assigned, and (6) in agreement with the contig label (i.e., same taxid or have it as an ancestor), (7) the support the taxid received and, optionally, (8) the full lineage. The result can be converted to a TSV-file, and to a Kraken [14] report or a Krona [9] visualization (Supp. Information).
III. RESULTS
Bacterial dataset
The CAMI-I high-complexity challenge and its accompanying RefSeq 2015 reference database [11] were given to MMseqs2 and CAT. AM-BER v2 [8] was used to assess the taxonomic assignment by computing the average completeness (Fig 1B) and purity (Fig S1) bp using its taxonomic binning benchmark mode. At similar assignment quality, MMseqs2 taxonomy is 18x faster than CAT. Using the nr database, MM-seqs2 is 10x faster (Fig S2).
Eukaryotic dataset
All 57 SAR (taxid 2698737) RefSeq assemblies and their taxonomic labels were downloaded from NCBI in 08/2020. To resemble metagenomic data, their scaffolds were randomly divided following the length distribution of contigs assembled for sample ERR873969 of eukaryotic Tara Oceans [4], resulting in 2.7 million non-overlapping contigs with a minimal length of 300 bp. Using nr from 08/2020, MMseqs2 classified more contigs than CAT (62% vs. 47%). For 36%, CAT extracted a fragment that did not hit the reference, suggesting fragments extracted by MMseqs2 are more informative for eukaryotic taxonomic annotation (Fig 1C, S3).
IV. CONCLUSION
MMseqs2 taxonomy is as accurate as CAT on a bacterial data set while being 3-18x faster and requiring fewer parameters. Its extracted fragments make it suitable for analyzing eukaryotes. It is accompanied by several taxonomy utility modules to assist with taxonomic analyses.
FUNDING
ELK is a FEBS long-term fellowship recipient. The work was supported by the BMBF CompLifeSci project horizontal4meta; the ERC’s Horizon 2020 Framework Programme [‘Virus-X’, project no. 685778]; the National Research Foundation of Korea grant funded by the Korean government (MEST) [2019R1A6A1A10073437, NRF-2020M3A9G7103933]; and the Creative-Pioneering Researchers Program through Seoul National University.
Conflict of Interest
none declared