Abstract
Motivation Advances in cloning and sequencing technology yielded a massive number of genome of virus strains. The classification and annotation of these genomes constitute important assets in the discovery of genomic variability, taxonomic characteristics and disease mechanisms. Existing classification methods are often designed for a well-studied virus. Thus, the viral comparative genomic studies could benefit from more generic, fast and accurate tools for classifying and typing newly sequenced strains of diverse virus families.
Results Here, we introduce a fast, accurate and generic virus classification platform, CASTOR, based on a machine learning approach. CASTOR is inspired by a well-known technique in molecular biology: Restriction Fragment Length Polymorphism (RFLP). It simulates the restriction digestion of genomic material by different enzymes into fragments in-silico. It uses two metrics to construct feature vectors for machine learning algorithms in the classification step. We benchmark CASTOR for the classification of distinct datasets of Human Papillomaviruses (HPV), Hepatitis B Viruses (HBV) and Human Immunodeficiency viruses (HIV). Results reveal true positive rates of 99%, 99% and 98% for HPV Alpha species, HBV genotyping and HIV M group subtyping respectively. Furthermore, CASTOR shows a competitive performance compare to well-known HIV-specific classifier REGA and COMET on whole genome and pol fragments. With such prediction rates, genericity and robustness, as well as rapidity, such approach could constitute a reference in large-scale virus studies. Finally, we developed the CASTOR web platform for open access and reproducible viral machine learning classifiers.
Availability http://castor.bioinfo.uqam.ca
Contact diallo.abdoulaye{at}uqam.ca