Abstract
Plasmids are extra-chromosomal genetic elements commonly found in bacterial cells that support many functional aspects including environmental adaptations. The identification of these genetic elements is vital for the further study of function and behaviour of the organisms. However it is challenging to separate these small sequences from longer chromosomes within a given species. Machine learning approaches have been successfully developed to classify assembled contigs into two classes (plasmids and chromosomes). However, such tools are not designed to directly perform classification on long and error-prone reads which have been gaining popularity in genomics studies. Assembling complete plasmids is still challenging for many long-read assemblers with a mixed input of long and error-prone reads from plasmids and chromosomes. In this paper, we present PlasLR, a tool that adapts existing plasmid detection approaches to directly classify long and error-prone reads. PlasLR makes use of both the composition and coverage information of long and error-prone reads. We evaluate PlasLR on multiple simulated and real long-read datasets with varying compositions of plasmids and chromosomes. Our experiments demonstrate that PlasLR substantially improves the accuracy of plasmid detection on top of the state-of-the-art plasmid detection tools. Moreover, we show that using PlasLR before long-read assembly helps to enhance the assembly quality in terms of plasmid recovery and near complete chromosome assembly from metagenomic datasets.
Competing Interest Statement
The authors have declared no competing interest.