Evaluating performance of metagenomic characterization algorithms using in silico datasets generated with FASTQSim

Background In silico bacterial, viral, and human truth datasets were generated to evaluate available metagenomics algorithms. Sequenced datasets include background organisms, creating ambiguity in the true source organism for each read. Bacterial and viral datasets were created with even and staggered coverage to evaluate organism identification, read mapping, and gene identification capabilities of available algorithms. These truth datasets are provided as a resource for the development and refinement of metagenomic algorithms. Algorithm performance on these truth datasets can inform decision makers on strengths and weaknesses of available algorithms and how the results may be best leveraged for bacterial and viral organism identification and characterization. Source organisms were selected to mirror communities described in the Human Microbiome Project as well as the emerging pathogens listed by the National Institute of Allergy and Infectious Diseases. The six in silico datasets were used to evaluate the performance of six leading metagenomics algorithms: MetaScope, Kraken, LMAT, MetaPhlAn, MetaCV, and MetaPhyler. Results Algorithms were evaluated on runtime, true positive organisms identified to the genus and species levels, false positive organisms identified to genus and species level, read mapping, relative abundance estimation, and gene calling. No algorithm out performed the others in all categories, and the algorithm or algorithms of choice strongly depends on analysis goals. MetaPhlAn excels for bacteria and LMAT for viruses. The algorithms were ranked by overall performance using a normalized weighted sum of the above metrics, and MetaScope emerged as the overall winner, followed by Kraken and LMAT. Conclusions Simulated FASTQ datasets with well-characterized truth data about microbial community composition reveal numerous insights about the relative strengths and weaknesses of the metagenomics algorithms evaluated. The simulated datasets are available to download from the Sequence Read Archive (SRP062063).

Background 3 In silico bacterial, viral, and human truth datasets were generated to evaluate available metagenomics 4 algorithms. Sequenced datasets include background organisms, creating ambiguity in the true 5 source organism for each read. Bacterial and viral datasets were created with even and staggered 6 coverage to evaluate organism identification, read mapping, and gene identification capabilities of 7 available algorithms. These truth datasets are provided as a resource for the development and 8 refinement of metagenomic algorithms. Algorithm performance on these truth datasets can inform 9 decision makers on strengths and weaknesses of available algorithms and how the results may be best 10 leveraged for bacterial and viral organism identification and characterization. 11 12 Source organisms were selected to mirror communities described in the Human Microbiome Project as 13 well as the emerging pathogens listed by the National Institute of Allergy and Infectious Diseases. The 14 six in silico datasets were used to evaluate the performance of six leading metagenomics algorithms: 15 MetaScope, Kraken, LMAT, MetaPhlAn, MetaCV, and MetaPhyler. 16 17 Results 18 Algorithms were evaluated on runtime, true positive organisms identified to the genus and species 19 levels, false positive organisms identified to genus and species level, read mapping, relative abundance 20 estimation, and gene calling. No algorithm out performed the others in all categories, and the 21 algorithm or algorithms of choice strongly depends on analysis goals. MetaPhlAn excels for bacteria 22 and LMAT for viruses. The algorithms were ranked by overall performance using a normalized 23 weighted sum of the above metrics, and MetaScope emerged as the overall winner, followed by Kraken 24 and LMAT. 25 26 Conclusions 27 Simulated FASTQ datasets with well--characterized truth data about microbial community composition 28 reveal numerous insights about the relative strengths and weaknesses of the metagenomics algorithms 29 evaluated. The  combination with an optimized database and another version of the least common ancestors algorithm. 52 MetaPhlAn[9] relies on unique clade--specific marker genes identified from 3000 reference genomes. 53 The Livermore Metagenomic Analysis Toolkit (LMAT) exploits genetic relationships between different 54 organisms by pre--computing the occurrence of each short sequence across the entire reference 55 database and storing the evolutionarily conserved sequence patterns [10--12]. MetaCV translates 56 nucleotide sequences into six frame peptides, which are then decomposed into k--mers. The k--mer 57 frequency is computed in a protein--reference database and used to assign k--mer weights [13]. Finally, 58 MetaPhyler uses a precomputed database of reference phylogenetic marker genes to build a sequence 59 classifier. Sequence simulation can aid with answering questions about coverage requirements, necessary 70 sequence length, and whether paired--end or single--end sequencing should be used. For example, the 71 ART simulator was successfully used by the 1000 Genomes Project Consortium to examine the effects 72 of read length and PE insert size on a read's ability to map to the human genome [19]. 73 In this study, six in silico datasets were simulated by the FASTQsim tool. Figure S1 illustrates the 74 composition of each dataset. These datasets contained sequences from reference bacterial and viral 75 genomes, as most human pathogens are members of these taxa. The HMP Even and HMP Staggered 76 datasets were generated to include sequences from the 20 organisms from the Human Microbiome 77 Project[20] (Supplementary Table 1). The HMP organisms were selected for inclusion after an attempt 78 to benchmark the performance of MetaScope with the HMP dataset revealed potential contamination 79 in the dataset. As the HMP benchmark dataset was generated by sequencing organisms cultured in 80 vitro, there was no absolute truth for any background contaminant organisms in the dataset and it was 81 not possible to determine whether the contamination was real or whether MetaScope was calling false 82 positive organisms. 83 84 The bacterial dataset (Supplementary the Escherichia genus was added to the list due to the high abundance of representative sequences in 89 GenBank [22]. 90

91
Two virus datasets were generated with 21 species across 11 representative genera (Supplementary

103
Improvements to FASTQsim 104 The FASTQsim toolkit was augmented to annotate gene information for simulated reads [27]. The 105 "FASTQmapGenes" functionality was added, allowing users to specify NCBI accession ids to use for 106 annotating gene information in simulated reads. The FASTQsim toolkit uses the Entrez and SeqIO 107 libraries from BioPython[28] to download the specified files from GenBank in .gb format. The 108 GenbankParser[29] java application is then used to parse the .gb files in order to extract all information 109 encoded in the CDS and Gene tags. These gene and CDS annotations are appended to the headers 110 within the simulated FASTQ files generated by FASTQsim, such that all reads that fall within a CDS or 111 gene region are annotated with the corresponding CDS and gene information. 112

113
In silico data generation 114 The FASTQsim toolkit was used to generate six in silico datasets. All were generated with the Illumina 115 error and read length profile included with FASTQsim version 2.0, with no host background added. 116 Specifically, read length of 150 bases was used, with single base mutation, insertion, and deletion rates 117 as specified in the FASTQsim v. Two in silico datasets were generated -"HMP Even" and "HMP Staggered" (Supplementary Table 1

179
Runtime in seconds, true positive genus and species identification, false positive genus and species 180 identification, and false negative species calls were determined for each of the metagenomic algorithms 181 (Figure 1). Among the algorithms evaluated, only MetaScope mapped a small number of reads in our 182 datasets to a taxon rank below species. Consequently, although the initial focus of the Bacterial dataset 183 was to assess the ability of the algorithms to distinguish between different strains of the same species, 184 it was decided to evaluate both true and false positives at species and genus level. To determine an 185 overall rank of the algorithms across the datasets, the area occupied by each in the radar plot was 186 computed ( Table 1). When the polygon area was calculated using the MATLAB polyarea function and 187 summed across all datasets, MetaScope emerges as the winner, with the largest overall area. Kraken 188 and LMAT are the runner--ups, and MetaPhyler performed the worst. In addition to the algorithms' 189 rank overall, several trends can be noted in the individual performance categories. 190 The algorithms diverged in runtime by several orders of magnitude ( In addition to its high speed, MetaPhlAn also achieved the highest accuracy, defined as ratio of true with fewer than 100 reads assigned to each. Kraken has a similar false positive profile; it reports 1,266 218 species that account for <1% of the reads in the dataset. MetaCV reports 2,998 false organisms with 219 low read count, and LMAT reports 1,118 species that account for less than 0.01% of the reads. 220 MetaPhyler does not report results more specific than the Class taxonomy level for the Human 221 dataset, in line with the conservative approach of this algorithm. MetaPhlAn crashes with a 222 segmentation fault on the Human dataset, which most likely is an artifact of the non--host--filtering 223 approach used by this algorithm. 224 225 The algorithms were evaluated based on their ability to correctly map reads and predict relative 226 abundance of the organisms in the data (Figures 2,3). For the bacterial datasets, Kraken and MetaScope classified the highest number of reads correctly for both the genus and species level, and cluster closest 228 to the truth in the dendrogram. However, for the viral datasets, LMAT performed best, classifying the 229 most reads correctly. 230 Although the Actinomyces odontolyticus (NZ_DS264586.1) organism had the highest coverage (11.3x, 231 217512 reads) in the HMP staggered dataset, the algorithms on the whole did not perform well on this 232 organism. It was not identified by the Kraken, MetaCV, and MetaPhyler algorithms, and called at a low 233 level by MetaScope (153 reads) (Figure 2g) MetaCV mapped the most reads correctly -108,211 (49.7%) 234 and MetaPhlAn was second best, identifying 22,647 (10.4%) of the reads. None of the algorithms 235 identified any of the 2,045 A. odontolyticus genes (Figure 5b). This poor performance likely results from 236 the fact that A. odontolyticus genome annotation in GenBank is incomplete [36]. Conversely, at the 237 species level, five of the six algorithms mapped a high number of reads to Streptococcus agalactiae for 238 both the HMP even and HMP staggered datasets (Figure 2f, 2g), but only a small number of reads for 239 this organism were present in the truth data. The relative abundance of Streptococcus mutans is lower 240 in the algorithm calls as compared to truth, while the relative abundance of Streptococcus agalactiae 241 is higher, suggesting that a number of the reads called for S. agalactiae are actually from S. mutans 242 (Figure 3b, 3d). This implies difficulty distinguishing between closely related species. Similarly, a high 243 number of reads are assigned correctly to the Yersinia and Escherichia genera by Kraken and 244 MetaScope (Figure 2c.) However, the algorithms under--assign reads for Escherichia albertii and over--245 assign reads for Yersinia pseudotuberculosis, which indicates difficulty in distinguishing between these 246 species (Figure 2h). 247 Overall, algorithms were equally as able to identify organisms in the staggered datasets as in the even 248 datasets, suggesting that accurate read mapping depends more on the database supplied to the 249 algorithm rather than the abundance of the organism in the dataset. Additionally, for the bacterial 250 datasets, Kraken, MetaScope, LMAT, and MetaPhlAn generally agreed on read mapping assignments. 251 However, for the viral datasets, the algorithms missed different sets of organisms -i.e., in Figure 3i The algorithms were also evaluated based on false positive hits (Figure 4). MetaCV and LMAT have 259 diverse error profiles -small numbers of reads are mapped to a high number of false positive 260 organisms. Our past experiences with the MetaScope algorithm suggest that this false positive profile 261 indicates an algorithm has difficulty classifying organisms that are not present in the reference database. 262 Ideally, when an algorithm encounters a novel organism, it should regress up the taxonomic tree until a 263 nearest neighbor for the unknown organism can be established. However, the algorithm may instead 264 report all reference organisms that match the unknown sample to a certain threshold. In contrast, 265 Kraken has a highly concentrated error profiles; fewer than 20 false positive organisms are reported, but 266 several thousand reads are mapped to each of them, suggesting high confidence calls. Finally, the gene calling capabilities of the algorithms were evaluated ( Figure 5) The authors declare that they have no competing interests. 316

317
Ethics approval was not required for this study because all data was generated in silico using references 318 available in GenBank, as indicated in Supplementary Tables 1--3