Abstract
The linear human reference genome that we use today does not represent the haplotypic diversity of the global human population. This raises bias in genomic read alignment and limits our ability to call large structural variations (SV), especially at highly polymorphic loci. Thus, many SV alleles remain unresolved. Recent efforts to transition to a graph-based reference genome resulted in the generation of the first draft human pangenome reference, but tools to call SVs relative to the pangenome reference are presently lacking. In this study, we present the SVarp algorithm, aiming to discover haplotype resolved SVs on top of a pangenome reference using long sequencing reads. SVarp outputs local assemblies of SV alleles, termed svtigs, instead of a VCF file of SV breakpoints, which we propose as a general exchange format allowing for flexible downstream analyses. In order to assess the accuracy of svtigs, we used simulated and real human genomes. Simulations allowed us to make exact breakpoint comparisons against the true callsets. We observed ∼96% recall with deletions, insertions and duplications larger than 1,000bp, showing that SVarp can reliably detect genomic structural variants not yet represented in the graph. On the other hand, we compared SVarp output for ONT sequencing data at 20X coverage against independent genome assemblies of the same samples and found that ∼82% of our svtig predictions are validated by the assemblies by a match with more than 85% sequence identity. SVarp was implemented using C++ and its source code is available at https://github.com/asylvz/SVarp under MIT license.
Competing Interest Statement
The authors have declared no competing interest.