Evaluating Metagenome Assembly on a Simple Defined Community with Many Strain Variants

We evaluate the performance of three metagenome assemblers, IDBA, MetaSPAdes, and MEGAHIT, on short-read sequencing of a defined “mock” community containing 64 genomes (Shakya et al. (2013)). We update the reference metagenome for this mock community and detect several additional genomes in the read data set. We show that strain confusion results in significant loss in assembly of reference genomes that are otherwise completely present in the read data set. In agreement with previous studies, we find that MEGAHIT performs best computationally; we also show that MEGAHIT tends to recover larger portions of the strain variants than the other assemblers.


Introduction
using the script trim-low-abund.py script with -C 5 from khmer v2 [27, 140 28]. 141 Assemblers 142 We assembled the quality-filtered reads using three different assemblers: [17], we used --pre correction to perform pre-correction before assembly 145 and -r for the pe files. IDBA could not ingest orphan sequences so singleton 146 reads were omitted from this assembly.  where -l is for maximum read length, -m is for max memory in bytes to 153 be used in constructing the graph, and --cpu-only uses only the CPU 154 and no GPUs. We also used --presets meta-large for large and complex 155 metagenomes, and --12 and -r to specify the interleaved-paired-end and 156 single-end files respectively. MEGAHIT allows the specification of a memory limit and we used -M 1e+10 for 10 GB. 158 All three assemblies were executed on the same XSEDE Jetstream in-159 stance (S1.Xxlarge) at Indiana University, running Ubuntu 16.04 (install  Unless otherwise mentioned, we eliminated all contigs less than 500 bp 167 from each assembly prior to further analysis. 168 Mapping 169 We aligned all quality-filtered reads to the reference metagenome with bwa 170 aln (v0.7.7.r441) [23]. We aligned paired-end and orphaned reads separately. 171 We then used samtools (v0.1.19) [29] to convert SAM files to BAM files for 172 both paired-end and orphaned reads. To count the unaligned reads, we 173 included only those records with the "4" flag in the SAM files [29]. 174 Assembly analysis using NUCmer 175 We used the NUCmer tool from MUMmer3.23 [30] to align assemblies to the 176 reference genome with options -coords -p. Then we parsed the generated 177 ".coords" file using a custom script analyze assembly.py, and calculated 178 several analysis metrics across all three assemblies at a 99% alignment iden-179 tity.

180
Reference-based analysis of the assemblies 181 We conducted reference-based analysis of the assemblies under two condi-  The script summarize-coords2.py was used to calculate aligned cov-187 erage from the loose alignment conditions: each base in the reference was 188 marked as "covered" if it was included in at least one alignment. The script 189 analyze ng50.py was used to calculate NGA 50 for each individual reference genome.

191
Analysis of chimeric misassemblies 192 We analyzed each assembly for chimeric misassemblies by counting the num-193 ber of contigs that contained matches to two distinct reference genomes. In 194 order to remove secondary alignments from consideration, we included only 195 the longest non-overlapping NUCmer alignments for each contig at a mini-196 mum alignment identity of 99%. We then used the script analyze chimeric2.py 197 to find individual contigs that matched more than one distinct reference 198 genome. As a negative control on our analysis, we verified that this ap-199 proach yielded no positive results when applied to the alignments of the 200 reference metagenome against itself.

201
Analysis of unmapped reads 202 We conducted assembly and analysis of unmapped reads with MEGAHIT, 203 NUCmer, and sourmash as above. The new GenBank genomes are listed in 204 the Zenodo archive at the file accession-list-unmapped.txt and for con-205 venience are available for download at the archival URL https://osf.io/34ef8/.

207
The raw data is high quality.     present at some minimal coverage. 245 We excluded the remaining 13 genomes (see Table 3) from any fur-246 ther reference-based analysis because interpreting recovery and misassembly 247 statistics for these genomes would be confounding; also see the discussion of 248 strain variants, below.

249
MEGAHIT is the fastest and lowest-memory assembler eval-250 uated 251 We ran three commonly used metagenome assemblers on the QC data set:   The assemblies contain most of the raw data 261 We assessed read inclusion in assemblies by mapping the QC reads to 262 the length-filtered assemblies and counting the remaining unmapped reads.

263
Depending on the assembly, between 2.7 million and 3.9 million reads (2.5-264 3.5%) did not map to the assemblies ( Table 5). All of the assemblies included 265 the large majority of high-abundance 51-mers (more than 96.8% in all cases).

266
Much of the reference is covered by the assemblies.   A '*' after the name indicates the presence of at least one other genome with > 2% Jaccard similarity at k=31 in the community. Where NGA50 cannot be calculated due to poor coverage, a marker is placed at 1kb.
Individual genome statistics vary widely in the assemblies. 283 We computed the NGA50 for each individual genome and assembly in order 284 to compare assembler performance on genome recovery (see left panel of Fig-285 ure 2). The NGA50 statistics for individual genomes vary widely, but there 286 are consistent assembler-specific trends: IDBA yields the lowest NGA50 for 287 28 of the 51 genomes, while MetaSPAdes yields the highest NGA50 for 32 288 of the 51 genomes. 289 We also evaluated aligned coverage per genome for each of the three 290 assemblies (right panel, Figure 2). We found that 13 of the 51 genomes were 291 missing 5% or more of bases in at least one assembly, despite all 51 genomes 292 having 99% or higher read-and 51-mer coverage.

293
There are 12 genomes with k=31 Jaccard similarity greater than 2% 294 to other genomes in the community, and these (denoted by '*' after the 295 name) typically had lower NGA50 and aligned coverage numbers than other 296 genomes. In particular, these constituted 12 of the 13 genomes missing 5% 297 or more of their content, and the lowest eight NGA50 numbers.

298
Longer contigs are less likely to be chimeric.  and assembled these reads in isolation using MEGAHIT, yielding 6.5 Mbp 311 of assembly in 1711 contigs > 500bp in length. We then did a k-mer in-312 clusion analysis of this assembly against all of the GenBank genomes at 313 k=31, and estimated the fraction of the k-mers that belonged to different 314 species (Table 9). We find that 51.1% of the k-mer content of these contigs 315 positively match to a genome present in GenBank but not in the reference 316 metagenome.

317
To verify these assignments, we aligned the MEGAHIT assembly of un-318 mapped reads to the GenBank genomes in Table 9 with NUCmer using 319 "loose" alignment criteria. We found that 1.78 Mbp of the contigs aligned 320 at 99% identity or better to these GenBank genomes. We also confirmed 321 that, as expected, there are no matches in this assembly to the full updated 322 reference metagenome. 323 We note that all but the two P. ruminis matches and the E. coli isolate 324 YS are strain variants of species that are part of the defined community 325 but are not completely present in the reads (see Table 2). For Proteiniclas-326 ticum ruminis, there is no closely related species in the mock community 327 design, and very little of the MEGAHIT assembly aligns to known P. ru-328 minis genomes at 99%. However, there are many alignments to P. ruminis Assembly recovers basic content sensitively and accurately.

336
All three assemblers performed well in assembling contigs from the con-337 tent that was fully present in reads and k-mers. After length filtering, 338 all three assemblies contained more than 95% of the reference (Table 6); 339 even with removal of secondary alignments, more than 87% was recovered 340 by each assembler (Table 7). About half the constituent genomes had an 341 NGA50 of 50kb or higher (Figure 2), which, while low for current Illumina 342 single-genome sequencing, is sufficient to recover operon-level relationships 343 for many genes.

344
The presence of multiple closely related genomes confounds 345 assembly.

346
In agreement with CAMI, we also find that the presence of closely related that one or more assembler will fail to recover 5% or more -of the 13/51 354 genomes for which less than 95% is recovered, 12 of them have close genomes 355 in the community. Interestingly, very little similarity is needed -all genomes 356 with Jaccard similarity of 2% or higher at k=31 exhibit these problems.

357
The Shewanella baltica OS185 genome is a good example: there are two 358 strain variants, OS185 and OS223, present in the defined community. Both 359 are present at more than 99% in the reads, and more than 98% in 51-mers, 360 but only 75% of S. baltica OS185 and 50% of S. baltica OS223 are recovered 361 by assemblers. This is a clear case of "strain confusion" where the assemblers 362 simply fail to output contigs for a substantial portion of the two genomes.

363
Another interest of this study was to examine cross-species chimeric as-364 sembly, in which a single contig is formed from multiple genomes. In Table 8, 365 we show that there is relatively little cross-species chimerism. Surprisingly, 366 what little is present is length-dependent: longer contigs are less likely to 367 be chimeric. This might well be due to the same "strain confusion" effect 368 as above, where contigs that share paths in the assembly graphs are broken 369 in twain.

370
MEGAHIT performs best by several metrics.
MetaSPAdes and IDBA in memory and time (Table 4). The MEGAHIT 373 assembly also included more of the reads than either IDBA or MetaSPAdes, 374 and omitted only 0.4% more of the unique 51-mers from the reads than 375 IDBA. MEGAHIT covered more of the reference genome with both loose 376 and strict alignments (Table 6 and Table 7 Figure 2 suggest that much of 396 this differential assembly content is due to the impact of strains.

397
The missing reference may be present in strain variants of the 398 intended species.

399
Several individual genomes are missing in measurable portion from the QC 400 reads (Table 2), and many QC reads (4.4% of 108m) did not map to the full 401 reference metagenome. These appear to be related issues: upon analysis of 402 the unmapped reads against GenBank, we find that many of the contigs as-403 sembled from the unmapped reads can be assigned to strain variants of the 404 species in the mock community (Table 9)   An additional concern is that metagenome assemblies are often performed after pooling data sets to increase coverage (e.g. [4,32]); this pooled 445 data is more likely to contain multiple strains, which would then in turn 446 adversely affect assembly of strains. This may not be resolvable within the 447 current paradigm of assembly, which focuses on outputting linear assem-448 blies that cannot properly represent strain variation.