Abstract
Efficiently detecting genomic structural variants (SVs) is a key step to grasp the “missing heritability” underlying complex traits involved in major evolutionary processes such as speciation, phenotypic plasticity, and adaptive responses. Yet, the SV-based genotype/trait association studies are still largely overlooked mainly due to the lack of reliable detection methods. Here, we present a random forest (RF) method for accurate deletion identification: RF4Del. By relying on the analysis of the mapping profiles, data already available in most sequencing projects, RF4Del can easily and quickly call deletions.
Several classic and ensemble learning strategies were carefully evaluated using proper benchmark data. RF4Del was trained and tested on simulated data from the model species Drosophila melanogaster to detect deletions. The model consists of 13 features extracted from a mapping file. We show that RF4Del outperforms established SV callers (DELLY, Pindel) with higher overall performance (F1-score > 0.75; 6x-12x sequence coverage) and is less affected by low sequencing coverage and deletion size variations. RF4Del could learn from a compilation of sequence patterns linked to a given SV. Such models can then be combined to form a learning system able to detect all types of SVs in a given genome, beyond the one used in our study. https://github.com/alvesrcoo/eletric-scheep
Competing Interest Statement
The authors have declared no competing interest.
Footnotes
Roberto Xavier: rbxjunior{at}gmail.com, Anna-Sophie Fiston-Lavier: anna-sophie.fiston-lavier{at}umontpellier.fr, Ronnie C.O. Alves: ronnie.alves{at}itv.org, Emira Cherif: emira.cherif{at}ird.fr.