Abstract
Taxonomic classification of reads obtained by metagenomic sequencing is often a first step for understanding a microbial community. While species level classification has become routine, correctly assigning sequencing reads to the strain or sub-species level has remained a challenging computational problem. We introduce Mora, a MetagenOmic read Re-Assignment algorithm capable of assigning short and long metagenomic reads with high precision, even at the strain level. Mora is able to accurately re-assign reads by first estimating abundances through an expectation-maximization algorithm and then utilizing abundance information to re-assign query reads. The key idea behind Mora is to maximize read re-assignment qualities while simultaneously minimizing the difference from estimated abundance levels, allowing Mora to avoid over assigning reads to the same genomes. On simulated data, this allows Mora to achieve F1 scores of > 74% when assigning reads generated from three distinct E. coli strains, more than double of the F1 scores achieved by Pathoscope2, Pufferfish, Clark, and Bowtie2. Furthermore, we show that the high penalty of over assigning reads to a common reference genome allows Mora to accurately identify the presence of low abundance strains and species.
Code availability https://github.com/AfZheng126/MORA
Competing Interest Statement
The authors have declared no competing interest.
Footnotes
andrewf.zheng{at}mail.utoronto.ca, {jshaw{at}math.toronto.edu, ywyu{at}math.toronto.edu}
⋆ Supported by Natural Sciences and Engineering Research Council of Canada (NSERC) grant RGPIN-2022-03074.
Data of taxonomic classifiers Kraken2 and CLARK, and initial aligners Bowtie2, Pufferfish, and Minimap2 have been added to give a more complete picture of how Mora compares to other algorithms. There are also slight edits expand on ideas discussed in the paper.