Reference genome-independent taxonomic profiling of microbiomes with mOTUs3

Background Taxonomic profiling is a fundamental task in microbiome research that aims to detect and quantify the relative abundance of microorganisms in biological samples. Available methods using shotgun metagenomic data generally depend on the availability of sequenced and taxonomically annotated reference genomes. However, the majority of microorganisms have not been cultured yet and lack such reference genomes. Thus, a substantial fraction of microbial community members remains unaccounted for during taxonomic profiling of metagenomes, particularly in samples from underexplored environments. To address this issue, we have developed the mOTU profiler, a tool that enables reference genome-independent species-level profiling of metagenomes. As such, it supports the identification and quantification of both “known” and “unknown” species based on a set of select marker genes. Results Here, we present mOTUs3, a command line tool that enables the profiling of metagenomes for >33,000 species-level operational taxonomic units. To achieve this, we leveraged the reconstruction and analysis of >600,000 draft genomes, most of which are metagenome assembled genomes (MAGs), from diverse microbiomes, including soil, freshwater systems, and the gastrointestinal tract of ruminants and other animals, which we found to be greatly underrepresented by reference genomes. Overall, two-thirds of all species-level taxa lacked a reference genome. The cumulative relative abundance of these newly included taxa was low in well-studied microbiomes, such as the human body sites (6-11%). By contrast, they accounted for substantial proportions (ocean, freshwater, soil: 43-63%) or even the vast majority (pig, fish, cattle: 60-80%) of the relative abundance across diverse non-human-associated microbiomes. Using community-developed benchmarks and datasets, we found mOTUs3 to be more accurate than other methods and to be more congruent with 16S rRNA gene-based methods for taxonomic profiling. Furthermore, we demonstrate that mOTUs3 greatly increases the resolution of well-known microbial groups into species-level taxa and helps identify new differentially abundant taxa in comparative metagenomic studies. Conclusions We developed mOTUs3 to enable accurate species-level profiling of metagenomes. Compared to other methods, it provides a more comprehensive view of prokaryotic community diversity, in particular for currently underexplored microbiomes. To facilitate comparative analyses by the research community, it is released with >11,000 precomputed profiles for publicly available metagenomes and is freely available at: https://github.com/motu-tool/mOTUs.

6 unassigned taxa was highest for soil samples (33%; s.d. 8%), which reflects the high microbial 146 diversity in soil as well as challenges in reconstructing genomes from this environment [26]. By 147 contrast, more than 87% (s.d. 0.7%) of the relative abundance was represented by ref-mOTUs in 148 human skin samples mainly due to the dominance of few taxa with cultivated representatives [27]. 149 Similarly, the fraction of relative abundance assigned to ext-mOTUs varied considerably between 150 environments: on average, only ~6% of the bacterial abundance in human-associated samples was 151 assigned to newly added taxa, while this fraction was as high as ~80% in cattle rumen microbiomes. 152

153
Comparison with other taxonomic profilers 154 As in other fields of bioinformatics, there is broad consensus that the performance of analysis tools 155 needs to be carefully evaluated. However, best practices (e.g., balancing precision and recall, 156 selecting criteria for 'best' performance) are often debated [28,29], and in microbiome research, an 157 agreement on some fundamental concepts (e.g., sequence vs. taxonomic abundance, representation of 158 unknown taxa in ground truth data) is still lacking [30,31]. In an attempt to address some of these 159 issues in a community-driven effort, modeled after successful examples in other fields [32,33], the 160 Critical Assessment of Metagenome Interpretation (CAMI) has provided curated ground truth datasets 161 along with a tool (OPAL) to reproducibly evaluate metagenomic analysis tools [3,22]. 162 Using the latest CAMI datasets with disclosed results [34], we compared mOTUs3 to its prior major 163 release version (mOTUs2) [14] and other selected metagenomic profiling tools (MetaPhlAn3 [5] and 164 Bracken [4,35], Methods) representing conceptually different, well-performing approaches to 165 taxonomic profiling [30]. Using the OPAL tool for scoring and evaluation, we first evaluated 166 presence/absence (F1-score) and relative abundance predictions (L1 norm error) at the species level. 167 For the different datasets, which represented samples from five human body sites and the mouse gut 168 microbiome, mOTUs3 and MetaPhlAn3 performed generally better than Bracken and mOTUs2 169 (Figure 2a/b). At higher taxonomic ranks, mOTUs3 had similar or higher scores than the other tools. 170 For some datasets, taxonomic ranks and tools, there was little to no room for improvements of the F1-171 score or L1 norm error. This may be due to the simulated datasets being mainly based on taxa for different profilers compared to the ground truth. In addition to the L1 norm error, OPAL computes 174 additional metrics for profiling quality (completeness, purity, weighted UniFrac error) and 175 summarizes them across taxonomic ranks into a composite score. Based on this evaluation criterion, 176 mOTUs3 outperformed the other tools (Figure 2c), as well as additional tools assessed in the CAMI 177 challenge (Methods; Supplementary Figure 2). 178 In the absence of independent ground truth data sets to benchmark taxonomic profiling tools for less 179 well-studied environments, we correlated taxonomic profiles obtained by mOTUs3 and other tools to 180 those obtained by analyzing 16S rRNA gene (16S) fragments. This approach leverages both the 181 availability of comprehensive 16S databases for taxonomic classification [36] and the possibility of 182 estimating taxonomic abundances based on 16S-based data from metagenomes [37]. Briefly, we 183 extracted 16S fragments from the same datasets we used for metagenomic profiling and generated 184 relative abundance profiles for them (Methods). To ensure comparability between 16S and 185 metagenomic profiles, the analysis was performed at the genus and higher taxonomic ranks (for 186 discussion, see Salazar et al. [37]). We found that mOTUs3 had consistently higher correlations with 187 16S profiles than the other tools across all environments, except for the human gut for which 188 MetaPhlAn3 showed correlation coefficients similar to those of mOTUs3 ( Figure 3) Figure 4a), which is common practice in the field of microbial phylogenomics [7,40]. 204 Moreover, we found sequence identities of mOTUs-representing MGs to linearly correlate with those 205 of whole genomes across the whole range of observed values (r 2 =0.71; Figure 4b). By contrast, 16S 206 sequence-based OTUs using a 97% or 99% sequence similarity cutoff resulted in a 31.7-fold (n=19) 207 or 5.8-fold (n=104) lower number of taxonomic units, respectively, compared to mOTUs (Figure 4a). 208 This discrepancy is also reflected by a weaker correlation (r 2 =0.45; Figure 4b High-resolution taxonomic profiling of metagenomes from underexplored environments can be 217 achieved by custom-made marker gene or genome databases selected for the microbial community 218 under study [12,41]. However, this approach is often labor-and resource-intensive and requires 219 specialized expertise, and its results cannot easily be compared across studies and communities. To 220 demonstrate the utility of mOTUs3 to address these challenges, we reanalyzed rumen metagenomes 221 from high-and low-methane emitting (HME and LME) sheep [41]. Importantly, these data were not 222 used for the database construction of mOTUs3. 223 Based on mOTUs3 taxonomic profiles, we identified 131 microbial species that differed significantly 224 in abundance between HME and LME samples and showed an at least tenfold increase or decrease in 225 relative abundance (corresponding to a generalized fold change of >= 1 [42]). Among these expected to be detectable by reference-based profilers. To test this, we applied the same workflow 228 using MetaPhlAn3 and Bracken (see Methods), which yielded only 10 and 30 differentially abundant 229 species for the respective tools ( Figure 5a). 230 Given the metabolic importance of methanogenic archaea in ruminants as well as previous evidence 231 of uncharted archaeal diversity in the sheep rumen [12], we further investigated the species-level 232 diversity of known and unknown archaeal species. To this end, we reconstructed a phylogenetic tree 233 of the archaeal mOTUs detected in the sheep rumen metagenomes (n=15) and contextualized them 234 with reference genomes from members of the genera Methanobrevibacter and Methanosphaera 235 ( Figure 5b). This analysis revealed that all six differentially abundant archaea in the sheep rumen 236 corresponded to ext-mOTUs. Two of them, which were significantly more abundant in high-methane 237 emitters, were most closely related to Methanobrevibacter gottschalkii, which itself was not detected. 238 Notably, the MG sequence similarity between these ext-mOTUs and M. gottschalkii was <85% 239 With mOTUs3, we have developed a taxonomic profiler that combines state-of-the-art accuracy, as 244 demonstrated in competitive benchmarks based on simulated datasets, with an innovative database 245 construction approach to detect and quantify underrepresented microbes from diverse environments at 246 high (i.e., species-level) taxonomic resolution. The ability to incorporate MG sequences from any 247 MAG and SAG to generate mOTUs de novo and independently from the availability of RefGs and/or 248 prior existence of taxonomic annotations (such as NCBI or GTDB species names) will allow users to 249 continuously extend the core database of mOTUs to represent microbial diversity from newly 250 explored microbiomes. Such future extensions could also target eukaryotic microorganisms, as these 251 are an integral part of many microbial communities, but are not well represented in databases of 252 existing taxonomic profiling tools. 253 However, the flexibility in defining operational taxonomic units de novo comes with a need for 254 taxonomic annotation, as is also the case for 16S rRNA-based de novo clustered OTUs. Despite the 255 calibration of MG sequence identity cutoffs to maximize congruence with the NCBI taxonomy [16], 256 this procedure can lead to conflicts with existing taxonomies. Irrespective of the ongoing debate on 257 whether prokaryotic species should be consistent with genomic similarity-based criteria, delineating 258 species by sequence identity puts mOTUs at a disadvantage in benchmarks, such as CAMI, which 259 rely on rigid matching of taxonomic labels. The high performance of mOTUs [34] despite this 260 disadvantage is likely due to the higher number of quantified taxa and the resulting reduction in 261 compositionality-related biases. 262 263

Conclusions 264
The present work introduces mOTUs3 as a reference-genome independent tool that allows for 265 charting the taxonomic landscape of many environments at species-level resolution. Its independence 266 from taxonomically annotated reference genomes, makes it generally applicable also beyond well-267 studied environments to quantify and reveal yet uncharacterized microbial species of potential 268 biological relevance. To support the research community, mOTUs3 is documented and available as 269 open source software at https://github.com/motu-tool/mOTUs. 270 271

Methods 272
Collection and processing of data to compile the mOTUs3 database 273 To extend the taxonomic coverage of the mOTUs3 database, 4,531 publicly available metagenomic 274 datasets from 23 environments (Supplementary Table 1) were processed to generate 150,880 MAGs 275 as previously described [43]. Briefly, BBMap (v.38.71) was used to quality control sequencing reads 276 from all samples by removing adapters from the reads, removing reads that mapped to quality control 277 sequences (PhiX genome) and discarding low-quality reads (trimq=14, maq=20, maxns=1 and 278 minlength=45). For metagenomic data of human origin, human genome-derived reads were removed 279 using the masked human reference genome provided by BBMap. Quality-controlled reads were 280 merged using bbmerge.sh with a minimum overlap of 16 bases, resulting in merged, unmerged paired and single reads. The reads were assembled into scaffolded contigs (hereafter scaffolds) using the 282 SPAdes assembler (v3.14 or v3. 12) [44] in metagenomic mode. Genes were predicted on length-283 filtered (≥ 500 bp) scaffolded contigs (hereafter scaffolds) using Prodigal (v2.6.3) [45]. Universal 284 single-copy phylogenetic marker genes (MGs) were extracted using fetchMGs (v1.2; -m extraction) 285 [16]. 286 Scaffolds were length-filtered (≥ 1000 bp) and within each study, quality-controlled reads from each 287 sample were mapped against the scaffolds of each sample. Mapping was performed using BWA 288      The evaluation was performed using the OPAL tool [1] on 63 simulated mouse gut metagenomes [2], which also provided taxonomic profiles for seven different taxonomic profiling tools, and to which we have added mOTUs3 profiling results. The OPAL tool ranks the tools for each sample and for each taxonomic level. The measures considered are completeness, purity, L1 norm error and weighted UniFrac error, shown individually in the bottom 4 plots. Tools with a lower score perform better, as the OPAL score is a sum over rank. The top plot represents the OPAL sum of scores, which is the sum over the four individual measures. mOTUs3 scored best in all categories, including the OPAL sum of scores.

Supplementary Table Legends
Supplementary Data from 91 studies from 23 environments were included in the extension and/or profiling of the mOTUs database. Of these, 39 studies were selected for in-house MAG reconstruction and 11,164 sequencing samples from 67 studies were used for taxonomic profiling.

Supplementary Table 3: Breakdown of taxonomic novelty in ext-mOTUs.
Taxonomic novelty increases with higher ranks, i.e., more than 50% of ext-mOTUs were assigned to previously unknown families.