Benchmarking metagenomic classification tools for long-read sequencing data

Josip Marić; Krešimir Križanović; Sylvain Riondet; Niranjan Nagarajan; Mile Šikić

doi:10.1101/2020.11.25.397729

Abstract

We performed a comprehensive assessment of metagenomics classification tools on long sequenced reads. In addition to well defined mock communities, we prepared various synthetic datasets to simulate real-life scenarios. The results show that off-the-shelf mappers such as Minimap2 or Ram are at least comparable with mapping-based classification tools in most accuracy measures while not being much slower than kmer based tools and requiring equal or less RAM. Majority of tested tools are prone to report organisms not present in datasets and underperform in the case of high presence of host’s genetic material. Furthermore, longer read lengths make classification easier, but due to the difference in read length distributions among species, the usage of only longest reads reduces the accuracy. Finally, evaluation on a mock community shows the importance of careful isolation of genetic material and sequencing preparation.

Availability and implementation Python scripts used to generate all figures and tables in this study, and all supplementary texts and figures are available via the Github repository https://github.com/lbcb-sci/MetagenomicsBenchmark. Datasets, supporting files, analysis results and reports are available via Zenodo repository https://doi.org/10.5281/zenodo.5203182.

Competing Interest Statement

The authors have declared no competing interest.

Footnotes

The initial paper was expanded with additional datasets, CLARK-S tool, testing of mapping tools and different analyses of obtained data.
https://doi.org/10.5281/zenodo.5203182

The copyright holder for this preprint is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under a CC-BY-NC 4.0 International license.