ABSTRACT
The accurate characterization of structural variation is crucial for our understanding of how large chromosomal alterations affect phenotypic differences and contribute to genome evolution. Whole-genome sequencing is a popular approach for identifying structural variants, but the accuracy of popular tools remains unclear due to the limitations of existing benchmarks. Moreover, the performance of these tools for predicting variants in non-human genomes is less certain, as most tools were developed and benchmarked using data from the human genome.
To address this problem, multiple short- and long-read tools were benchmarked using real and simulated Caenorhabditis elegans whole-genome sequence data. To evaluate the use of long-read data for the validation of short-read predictions, the agreement between predictions from a short-read ensemble learning method and long-read tools were compared. The results obtained from simulated data indicate that the best performing tool is contingent on the type and size of the variant, as well as the sequencing depth of coverage. These results also highlight the need for reference datasets generated from real data that can be used as ‘ground truth’ in benchmarks.
INTRODUCTION
Large alterations in chromosome structure contribute substantially to the genetic diversity observed in natural populations and play a fundamental role in the evolution of novel genes [1]. These changes that span large segments of the genome (e.g., > 100 bp) are termed structural variants (SVs) and include deletions, tandem and interspersed duplications, insertions, and inversions. SVs may be neutral, deleterious, or adaptive [2], and are known to facilitate speciation [3]. SVs drive genome evolution using several mechanisms. For example, large heterozygous inversions can suppress recombination, thereby protecting locally adapted alleles [4]. Also, copy number variation (CNV) is an important factor in genome evolution that describes the gain or loss of genes. CNVs are associated with a wide range of phenotypic effects due to the modulation of gene expression, including differential drug responses between individuals [5], HIV susceptibility [6], autism spectrum disorders [7], and schizophrenia [8]. Gene duplication and subsequent diversification is a source of novel genes and functional diversification [9–11]. Importantly, SVs have also been proposed as a source of “missing heritability” seen in genome-wide association studies [12].
Several approaches have been developed for the detection of SVs from paired-end whole-genome sequencing (WGS) data [13,14]. Paired-end approaches detect SVs when the orientation of a mapped read-pair is inconsistent with the reference genome or when the alignment produces an unexpected insert size. Split-read approaches detect SVs by identifying individual reads that span a given variant, resulting in at least two partial alignments. CNVs may be detected using read-depth differences caused by gene loss or gain. Hybrid approaches that combine multiple signals are employed by many tools and, recently, ensemble methods that leverage multiple separate SV callers have been developed [15,16].
The emergence of international ‘sequence everything’ projects [17,18] and continual reduction in sequencing costs will enable researchers to study the role that SV plays in the evolution of their favourite species. Although numerous tools are available, most benchmarks are limited to the human genome and a limited range of sequencing depths [13,14,19,20]. Due to the difficulty in validating SVs, benchmarks often rely upon simulated data or incomplete sets of experimentally validated variants [21]. Long-read support has also been used to validate SV calls [22], but the accuracy of this method is unknown. Because genome properties, such as repeat content, GC content, or heterozygosity, can differ significantly between species, it is unclear how callers benchmarked on the human genome will perform for other species. At time of writing, most curated, large, multi-population, whole genome datasets for metazoan species are restricted to humans. A notable exception is the Caenorhabditis elegans Natural Diversity Resource (CeNDR) which houses whole genome data for 1514 isolates and 548 strains of the free living nematode, C. elegans (August 2020) [23]. Because the C. elegans genome differs from the human genome in several key characteristics, evaluating SV calling in C. elegans would provide valuable points of comparison with past benchmarks. For example, although the GC content in C. elegans (36%) [24] does not differ considerably from humans (41%) [25], it is distributed uniformly among the chromosomes, whereas the GC content in the human genome varies considerably depending on the location. Furthermore, C. elegans reproduces primarily through self-fertilizing hermaphrodites, which can lead to profound genomic differences compared dioecious species that reproduce solely through the mating of males and females. Notably, selfing results in increased homozygosity and reduced recombination, which have been predicted to impact transposon dynamics and the purging of deleterious alleles [26]. In terms of repeat content, the C. elegans genome contains substantially fewer transposons (∼12% of the genome) [27] and other repetitive sequences compared to the repeat-rich human genome (∼44% of the genome is derived from transposable elements) [28]. This is a key factor, as repeats are associated with structural variation [29,30] but are also a source of error in SV calling [14,19,31].
Here, the performance of SV calling in Caenorhabditis elegans was evaluated using real and simulated data. Multiple commonly used short- and long-read SV calling tools were benchmarked using mock genomes containing simulated variants. Real data were used to assess the degree of overlap between individual callers and to determine if the results from an ensemble short-read approach were comparable to those obtained by long-read callers. The benchmarks shown here demonstrate that SV prediction depends highly on the tool used and that optimal tool choice for each platform depends on the type and size of SV. These results provide valuable information for researchers studying structural variation in C. elegans or other species with similar genome properties.
Results
Structural Variation Prediction Using Simulated Short-Read Data
To evaluate the performance of various short-read structural variant calling methods, mock genomes containing simulated deletions, duplications, and inversions were created, and used to generate simulated Illumina reads at 5X, 15X, 30X, and 60X sequencing depth of coverage. The caller performance varied by variant type and depth (Table 1).
Caller performance using simulated short-read data
Identification of known variants in C. elegans
For deletion calls, DELLY had the highest F-measure scores at all sequencing depths, followed closely by BreakDancer. The sequencing depth had a stronger impact on the other variant callers, except for Hydra. The accuracies for cnMOPS, CNVnator, and Tigra all improved considerably above 5X, while increased false positives accounted for the decreased accuracy in Lumpy at 60X. The performance of FusorSV was similar the best performing tools for each variant type at all depths.
No single metric completely describes the performance of each variant caller. Because the precision, recall, and F1 scores were calculated based on the number of true positives, false positives, and false negatives, they fail to describe the performance of each caller at the base pair level. The Jaccard similarity score was used to describe the base pair overlap between the predicted variants and simulated variants. Because the Jaccard score is based on the proportion of intersecting base pairs, the metric is biased towards the performance of larger SVs. Although DELLY had the highest accuracy at all depths, its Jaccard value decreased from 0.99 at 30X depth to 0.74 at 60X depth (Fig 1). This decrease was caused by a 2.56Mbp false positive at the 60X depth. Differences between the Jaccard and F1 scores were observed for several sizes. At 5X depth, the CNVnator F1 and Jaccard scores were 0.53 and 0.91 respectively (Supplemental_Table_S1.xls). Because CNVnator performed well at predicting larger variants, a higher Jaccard score was obtained despite low accuracy for smaller deletion sizes. Conversely, the Hydra F1 scores ranged between 0.72 and 0.79, while its Jaccard scores ranged between 0.12 and 0.13. The large discrepancies between the F1 score and Jaccard scores resulted from good performance for smaller deletions but poor performance for larger sizes.
Accuracy of predicted deletions from simulated short-read data. Results shown for 60X depth.
For duplication calls, DELLY had the highest F-measure scores at all depths, followed closely by BreakDancer. The accuracies of CNVnator, Hydra, and Lumpy improved with increased depth between 5X and 30X, while only a slight decrease was observed for Hydra at 60X depth. The accuracy of Tigra was highest at 15X and decreased with further increases in depth. Increased depth decreased the cnMOPS accuracy due to higher false positive rates. The performance of FusorSV was similar the best performing tools for each variant type at all depths.
The performance of cnMOPS and Hydra were both dependent on the size of the predicted duplications. The rate of false positive duplications increased with higher sequencing depths for cnMOPS and was biased towards smaller duplication sizes (Fig 2). All predicted duplications from Hydra were below 10 Kbp, which led to a Jaccard score of 0.01 at all depths (Supplemental_Table_S2.xls). At 60X depth, DELLY predicted a 2.32 Mbp false positive duplication, which resulted in a decreased Jaccard score (0.83).
Accuracy of predicted duplications from simulated short-read data. Results shown for 60X depth.
BreakDancer, DELLY, and Lumpy performed well for the prediction of inversions, as the accuracy of each tool was at least 0.92 at all depths. The accuracy for Hydra was considerably lower at 5X and decreased with increasing depth. The precision, recall, and F1 score for FusorSV was similar to the best performing tools for each variant type at all depths. However, lower Jaccard scores were observed in FusorSV (Supplemental_Table_ S3.xls) due to several multiple megabase spanning false positives that were not predicted by other callers.
The performance of BreakDancer, Hydra, and DELLY were dependent on the size of the predicted inversions (Fig 3). Both BreakDancer and DELLY performed better for higher inversion sizes, while the performance of Hydra decreased with increasing size. A 4.27 Mbp false positive in DELLY at 15X, 30X, and 60X resulted in decreased Jaccard scores.
Accuracy of predicted inversions from simulated short-read data. Results shown for 60X depth.
Prediction of Known Structural Variants in C. elegans
BC4586 is a C. elegans strain containing experimentally validated structural variants [32]. Publicly available Illumina sequencing data allowed us to determine if the short-read SV callers in the SVE/FusorSV pipeline can resolve a 3910bp deletion (DEL-1; IV:9,853,675–9,857,585), a 552bp tandem duplication (DUP-1; IV:9,853,123–9,853,675), and a 4812bp inversion (INV-1; IV:9,857,585–9,862,397). Only cnMOPs and CNVnator were able to resolve the deletion. Among the callers capable of predicting inversions, BreakDancer, DELLY, and Hydra predicted the inversion. Each caller predicted the tandem duplication. Both BreakDancer and DELLY predicted multiple overlapping duplications spanning DUP-1.
Structural Variation Prediction Using Simulated Long-Read Data
Simulated PacBio DNA sequencing data was used to evaluate using long-read sequencing to validate SV calls generated from short-read sequencing platforms. The mock genomes containing simulated deletions, duplications, and inversions were used to generate simulated PacBio reads at 5X, 15X, 30X, 60X, and 142X depth of coverage.
The performance of each caller varied considerably by variant type and depth (Table 3). For deletions, SVIM had a considerably higher accuracy at 5X depth (F1-score = 0.848) compared to pbsv (F1-score = 0.529) and Sniffles (F1-score = 0.052). SVIM had the highest accuracy at 15X (F1-score = 0.945), followed by Sniffles (F1-score = 0.918), and pbsv (F1-score = 0.566). Sniffles had the highest accuracy at 30X depth (F1-score = 0.993), followed by SVIM (F1-score = 0.959) and pbsv (F1-score = 0.599). SVIM had the highest accuracy at 60X (F1-score = 0.956), followed by Sniffles (F1-score = 0.929) and pbsv (F1-score = 0.781). At 142X, SVIM had the highest accuracy (F1-score = 0.956), followed by pbsv (F1-score = 0.793) and Sniffles (F1-score = 0.423).
Performance of long-read structural variant callers on simulated data.
The accuracy of predicted duplications at 5X was higher for SVIM (F1-score = 0.452), compared to pbsv (F1-score = 0.356) and Sniffles (F1-score = 0.033). At 15X, 30X, and 60X depth, Sniffles had the highest accuracy (15X F1-score = 0.909; 30X F1-score = 0.983; 60X F1-score = 0.935) compared to pbsv (15X F1-score = 0.378; 30X F1-score = 0.462; 60X F1-score = 0.554) and SVIM (15X F1-score = 0.571; 30X F1-score = 0.605; 60X F1-score = 0.636). SVIM had the highest accuracy at 142X (F1-score = 0.659), followed by pbsv (F1-score = 0.531), and Sniffles (F1-score = 0.321). Both pbsv and SVIM had lower recall than precision, indicating that missed variant calls decreased the accuracy of these callers. Lower recall also decreased the accuracy of Sniffles at 5X, 15X, and 30X depth, while lower precision contributed more at 60X and 142X depth.
The accuracy of predicted inversions was higher in pbsv at 5X (F1-score = 0.686), 60X (F1-score = 0.745) and 142X depth (F1-score = 0.760). Sniffles had the highest accuracy at 15X (F1-score = 0.978) and 30X depth (F1-score = 0.923). Lower precision in SVIM at 5X (0.457) accounted for lower accuracy (F1-score = 0.493). At 15X, the SVIM precision and F1-scores increased to 0.836 and 0.742, respectively. The highest SVIM precision (0.918) and accuracy (F1-score = 0.763) was obtained at 30X. The SVIM precision decreased at 60X (0.833) and 142X depth (0.153), which resulted in lower accuracy (60X F1-score = 0.732; 142X F1-score = 0.248). At each depth, recall contributed more to decreased accuracy in pbsv. For Sniffles, lower recall contributed more to lower accuracy at 5X and 15X depth, while lower precision contributed more to lower accuracy at the higher depths.
Agreement in Predicted Structural Variants for Wild C. elegans strains
To evaluate the usefulness of long-read DNA sequencing data to validate structural variants predicted from short-read technologies, we obtained data for 14 C. elegans wild strains with both Illumina and PacBio sequencing data.
The predicted variants varied considerably between the callers (Table 3). The predicted deletions ranged from a median of 76 per strain in SVIM to 341 in cnMOPs. Conversely, cnMOPs only predicted a median of 5.5 duplications per strain compared to 129 in Sniffles. The median predicted inversions per strain ranged from 8 in Lumpy to 88 in BreakDancer.
Predicted deletions, duplications, and inversions in C. elegans
Because the exact breakpoints for the same predicted variant may differ between callers, it is difficult to directly compare the agreement of calls generated by different tools. Therefore, predicted variants spanning protein-coding genes were used to compare caller agreement. For each comparison between callers, only predictions of a given SV type that spanned the exact same set of genes were considered to be in agreement with each other. Overlapping predictions between the long-read callers and FusorSV were compared to evaluate using long-read sequencing data for the validation of variants predicted from short-read data.
Among the total set of predicted deletions spanning genes, 190 were shared among all long-read callers. Of these deletions, 119 were predicted by FusorSV and 64% of the genes spanned by deletions predicted by FusorSV were not shared by at least one long-read caller (Fig 4; Supplemental_Table_S4.xls). Within the set of genes overlapping deletions predicted by SVIM, 95% were shared by at least one other caller. The other long-read tools had higher counts of unique predictions not found in any other caller (Assemblytics = 84%, MUM&Co = 32%, pbsv = 28%, sniffles = 45%, SVIM = 5%). The percentage of unique genes spanned by deletions also varied among the short-read callers (Fig 4; Supplemental_Table_S5.xls) and ranged from 3% in DELLY to 80% in cnMOPS and CNVnator.
Caller agreement for deletions spanning protein-coding genes predicted from long-read tools and FusorSV. Each row represents the set of genes covered by deletions for a given caller. The columns depict the intersection of these predictions between callers. The plot is limited to the 20 largest sets.
Among the total set of predicted duplications spanning genes, only 66 were shared among all long-read callers. Of these only one was predicted by FusorSV (Fig 5; Supplemental_Table_S6.xls). 77% of the genes spanned by duplications predicted by FusorSV were not shared by at least one long-read caller. Within the set of genes overlapping duplications, MUM&Co contained the fewest unique predictions (11%) followed by SVIM (14%), Assemblytics (20%), pbsv (42%), and Sniffles (44%). The percentage of unique genes spanned by duplications also varied considerably among the short-read callers (Supplemental_Table_S7.xls) and ranged from 5% in DELLY to 90% in CNVnator.
Caller agreement for duplications spanning protein-coding genes predicted from long-read tools and FusorSV. Each row represents the set of genes covered by duplications for a given caller. The columns depict the intersection of these predictions between callers. The plot is limited to the 20 largest sets.
Among the total set of predicted inversions spanning genes, 59 were shared among all long-read callers. Of these inversions, seven were predicted by FusorSV (Fig 6; Supplemental_Table_S8). Within the set of genes overlapping inversions, SVIM contained the fewest unique predictions (41%) followed by pbsv (51%), MUM&Co (60%), and Sniffles (69%). 89% of the genes spanned by inversions predicted by FusorSV were not shared by at least one long-read caller. The percentage of unique genes spanned by inversions also varied considerably among the short-read callers (Supplemental_Table_S9.xls) and ranged from 17% in Lumpy to 47% in Hydra.
Caller agreement for inversions spanning protein-coding genes predicted from long-read tools and FusorSV. Each row represents the set of genes covered by inversions for a given caller. The columns depict the intersection of these predictions between callers. The plot is limited to the 20 largest sets
Discussion
Structural variant (SV) prediction is challenging for researchers studying non-human genomes. Most SV prediction tools were designed for the human genome, and benchmarks on other species are lacking [13,15,16,19]. Without an adequate guide for tool selection, the accuracy of predicted SVs has a high degree of uncertainty. Here, we used simulated and real data to benchmark six short-read structural variant callers included in SVE [15], a pipeline developed to be used with FusorSV, an ensemble learning method that leverages the strengths of each individual caller.
The results for the simulated short-read data suggest that deletions and duplications may be predicted with high confidence using BreakDancer and DELLY, and accurate inversion predictions may be obtained using BreakDancer, DELLY, and Lumpy. The FusorSV performance typically reflected that of the best performing individual tools, but occasionally predicted large megabase spanning false positives. WGS from the 1000 Genomes Project (1000GP) [33] may be used to benchmark SV prediction using real data. These data provide a high-confidence human truth set, as SVs were validated using a combination of short- and long-read WGS data, Moleculo synthetic long-read sequencing, microarray SV detection, and targeted long-read sequencing. Our results are in discordance with the values reported in the literature for benchmarks generated using human data from the 1000 Genomes Project (1000GP) [15]. For example, the accuracy of deletion calls from 1000GP data were considerably lower for BreakDancer (F1-score = 0.47), DELLY (F1-score = 0.54), and FusorSV (F1-score = 0.62). Our results for duplications and inversions were substantially better than those reported for BreakDancer (duplication F1-score = 0.00, inversion F1-score = 0.08) DELLY (duplication F1-score = 0.01, inversion F1-score = 0.09), and FusorSV (duplication F1-score = 0.19, inversion F1-score = 0.45) based on 1000GP data. Differences in genome properties, such as repeat content, could account in part for the improvements seen here, but the usage of simulated data was likely a major factor. Real data with experimentally validated variants would be valuable for identifying the underlying causes of these differences.
For the simulated PacBio data, the performance of each caller varied considerably by variant type and depth. For deletions, SVIM had the highest accuracy at 5X (F1-score = 0.848), 15X (F1-score = 0.945), 60X (F1-score = 0.956), and 142X depth (F1-score = 0.956), while Sniffles had the highest accuracy at 30X depth (F1-score = 0.993). For duplications, SVIM had the highest accuracy at 5X (F1-score = 0.452), and Sniffles had the higher accuracy at 15X (F1-score = 0.909), 30X (F1-score = 0.983), and 60X depth (F1-score = 0.935). Again, SVIM had the highest accuracy at 142X depth (F1-score = 0.659). For inversion calls, a higher accuracy was obtained using pbsv at 5X (F1-score = 0.686), 60X (F1-score = 0.745) and 142X depth (F1-score = 0.760). Sniffles had the highest accuracy at 15X (F1-score = 0.978) and 30X depth (F1-score = 0.923).
It should be noted that SVIM generates a VCF file containing all candidate SV calls, including calls of low confidence. Each prediction includes a quality score between 0 and 100 that provides a confidence estimate. The authors recommend using a threshold between 10-15 for higher depth datasets (e.g., >40X) or a threshold that generates the expected number of predictions. Therefore, for datasets with lower sequencing depth, the analyst may be limited to selecting an arbitrary cut-off when the expected number of SVs is unknown. The thresholds we used for 5X (minimum QUAL = 2), 15X (minimum QUAL = 5), and 30X depth (minimum QUAL = 10) were proportional to the decrease in depth compared to the high depth specified by the SVIM authors. The performance for these cut-offs were similar to the optimum cut-off values that were calculated post-hoc (Supplemental_Table_ S10.xls, Supplemental_Table_S11.xls, Supplemental_Table_S12.xls), with the exception of inversions predicted at 142X depth, where a higher threshold is recommended.
Deletion benchmarks were previously described pbsv and Sniffles using data from the Database of Genomic Variants [34] and NCBI dbVar [34] projects. The precision and recall were quantified for different numbers of reads supporting the deletions. Precision values up to 0.91 and 0.81 were reported for pbsv and Sniffles, respectively. Lower recall values were observed for pbsv (up to 0.45) and Sniffles (up to 0.26). By contrast, we observed lower precision but higher recall using simulated data.
Illumina and PacBio sequencing data from 14 natural C. elegans strains were analyzed to determine the concordance between predicted SVs among both short- and long-read callers. Low agreement was observed among all predictions generated using either Illumina or PacBio data. Furthermore, many SV calls unique to a single caller were observed for predictions made using either short or long-reads. It is therefore difficult to ascertain the accuracy of structural variants described in the literature, as the considerably different results may be generated using different tools. Nonetheless, the simulated data suggests that short-read tools, such as DELLY, BreakDancer, and Lumpy are likely to provide more accurate SV calling across a range of depths compared to the other short-read tools that were included in these benchmarks. If training data are available, FusorSV may also be used, but large megabase spanning inversions should be interpreted with caution, as several large false positives were predicted by this tool. For SV prediction from long-reads, the simulated data suggested that neither pbsv, Sniffles, nor SVIM may be used with high-confidence for the prediction of duplications or inversions using lower depth data (e.g., 5X). Fewer generalizations can be made for higher depths. SVIM had the best performance for the prediction of deletions at 15X depth, while the performance of Sniffles was superior at 30X. Sniffles performed the best for the prediction of duplications and inversions at 15X and 30X depth. At 60X depth, SVIM had the highest accuracy for deletions, but Sniffles and pbsv demonstrated superior performance for duplications and inversions, respectively. At 142X depth, the SVIM accuracy was considerably higher than pbsv and Sniffles for deletions and duplications, while the accuracy of pbsv was considerably higher for inversions. If precision is less of a concern than recall, Sniffles may be the preferable choice for predicting duplications from long-read data.
The concordance between long-read callers is pertinent if long-read data is to be used to validate candidate SVs called using short-read data. Although few predicted SVs were common to all long-read callers, a large majority of the predicted deletions and duplications from SVIM were supported by at least one other caller. Therefore, SVIM might provide a more conservative option for validating deletions and duplications predicted from short-reads. Although inversions predicted by SVIM had the highest support among other callers, over one quarter of these calls were unique to SVIM. Therefore, long-read data may be less reliable for validating inversions predicted from short-reads.
To assess the agreement between short- and long-read approaches, the agreement between FusorSV and the long-read tools was measured. For each variant type, over three quarters of the FusorSV predictions were not shared by any of the long-read tools. Because higher accuracy has been reported for long-read tools in the literature, it is likely that FusorSV generate many false positives. Low agreement was observed for many of the SVs predicted by the individual tools used to train FusorSV despite higher accuracy observed in the simulated data. This may reflect a limitation in using simulated data to train FusorSV, as simulated data may bias the FusorSV models towards callers that perform poorly on real data.
The low agreement between different SV callers emphasizes the challenges involved in the selection of ideal tools for SV calling. Without high quality benchmarks in non-human species, tool choice is largely arbitrary, and the analyst will likely select a caller based on popularity. Despite the lack of benchmarks, the researchers increasingly rely upon these tools for the characterization of SVs in non-human species, calling into question the reliability and reproducibility of past research findings. Improved benchmarks will provide an important resource for the research community.
CONCLUSIONS
It is challenging to choose the appropriate tool for the prediction of structural variants from DNA sequencing data. Dozens of callers have been developed for calling structural variants using short-read data, but few independent benchmarks are available. Compounding this problem is the lack of benchmarks for non-human genomes. Here, multiple short-read and long-read callers were compared using both real and simulated C. elegans data. The results using simulated data showed that the performance of a given tool often varies considerably according to variant type and sequencing depth and that no single tool performed best for all situations. The predictions generated from real data showed low overlap among all callers and many predictions unique to individual tools.
Because variants predicted from short-reads depend highly on the tool used, the analyst may choose to validate these SVs using long-read data. However, the lack of a consensus among long-read callers suggests that using a long-read caller to generate a “truth-set” warrants caution. Nonetheless, most of the deletions and duplications predicted by SVIM and MUM&Co, respectively, were shared by at least one other caller. These tools may provide a more conservative approach for validating SV calls using long-reads. The availability of reference datasets for which the “ground truth” is known would provide valuable resources for improving our understanding of the best approaches for SV prediction in non-human organisms. Future benchmarking projects would benefit from publicly available data from strains with precise deletions of various lengths generated using CRISPR-Cas9 methods, as well as further lab strains with SVs validated manually using long-read technologies and PCR.
METHODS
STRUCTURAL VARIATION PREDICTION
Structural Variation Engine (SVE) and FusorSV (v0.1.3-beta) [15] were used to predict structural variants (deletions, duplications, and insertions) from real and simulated short-read sequencing data. SVE is an SV calling pipeline that produces VCF files compatible with FusorSV. FusorSV uses an ensemble learning approach to call structural variants using a fusion model trained using individual callers. The six structural variant callers included in SVE that support non-human genomes were evaluated here: BreakDancer [35], cnMOPS [36], CNVnator [37], DELLY [38], Hydra [39], and Lumpy [22]. The default SVE and FusorSV parameter settings were used.
Five tools were used to predict structural variants (deletions, duplications, and insertions) from real and simulated long-read sequencing data: Assemblytics [40] (v1.2.1), MUM&Co [41] (v2.4.2), pbsv (https://github.com/PacificBiosciences/pbsv) (v2.6.2), Sniffles [42] (v1.0.12a), and SVIM [21] (v2.0.0). For each tool, the recommended long-read aligner and default parameter settings were used. The genome assemblies and alignments required for Assemblytics and MUM&Co were created in Canu [43] (v2.2) and MUMMER [44] (4.0.0rc1) respectively. The alignments used with pbsv were created using pbmm2 (v.1.7.0). The alignments used by Sniffles and SVIM were created using ngmlr [42] (v.0.2.7). Low confidence predictions below the SVIM quality score threshold were discarded using a different cut-off for each sequencing depth (5X = 2; 15X = 5; 30X = 10; 60X = 15, 142X = 15).
Custom Python (v3.7) and Bash scripts were used to select the final set of SV predictions to benchmark based on the following criteria: minimum size >= 100 bp, and vcf file FILTER flag = “PASS”.
SIMULATED DATA
Svsim (v. 0.1.1) was used to create mock C. elegans genomes containing simulated structural variants (deletions, duplications, and inversions) based on the WormBase [45] (Wbcel235; https://parasite.wormbase.org/Caenorhabditis_elegans_prjna13758/Info/Index/) reference assembly for the N2 strain. Because FusorSV uses the SV type and size as discriminating features to train the fusion model, training datasets with variable numbers of simulated deletions, duplications, and inversions were created ranging from 200 bp to 280 Kbp. 43 mock genomes were created for the small training dataset and included a total of 10 structural variants of each type per size bin used by FusorSV. The medium training dataset included 86 mock genomes with a total of 20 variants of each type per size bin, and the large training dataset included 129 mock genomes with 30 variants of each type per size bin. A testing dataset was created that included 129 mock genomes with 30 variants of each type per size bin.
Short-read DNA sequencing of the mock genomes was simulated with the randomreads.sh script included with BBTools (sourceforge.net/projects/bbmap/) (38.79). Paired-end reads of 100bp were simulated at 5X, 15X, 30X, and 60X depths of coverage using the Illumina error model with default settings. SimLoRD [46] (v.1.0.4) was used to simulate PacBio sequencing data for the mock genomes. The SimLoRD PacBio sequencing runs were simulated at a depth of 5X, 15X, 30X, 60X, and 142X (the median depth of real PacBio data used to predict SVs using long-read tools). The SVIM quality score cut-off was 1 for 5X, 5 for 15X and 30X, 10 for 60X, and 15 for 142X.
REAL DATA
Data from the Caenorhabditis elegans Natural Diversity Resource (CeNDR) [23] were used to predict SVs in 14 C. elegans isolates collected from the wild. SV prediction using the SVE/FusorSV pipeline was performed using the BAM files provided for the 20200815 CeNDR release. SimuSCoP [47] (v1.0) was used to generate the simulated DNA-sequencing data that trained the FusorSV model. DNA sequencing of the C. elegans N2 reference strain were used to create the sequencing profiles used by SimuSCoP (SRA run = SRR3452263, SRA run = SRR1013928, SRA run = SRR9719854). For each strain, FusorSV models trained on simulated data of similar sequencing depth and read length were used to predict variants.
BC4586, a C. elegans strain containing validated structural variants, was used to evaluate the ability of the SVE/FusorSV pipeline in the prediction of experimentally validated structural variants. Simulated SimuSCoP DNA sequencing data was generated using an N2 profile (SRA run = SRR14489487) and used to train the FusorSV model. SVs for BC4586 (SRA run = SRR14489485) were predicted using the SVE/FusorSV pipeline and used to identify the presence of a deletion on chromosome IV (coordinates = 9853675-9857585), a tandem duplication on chromosome IV (coordinates = 9857585-9862397), and an inversion on chromosome IV (coordinates = 9853123-9853675).
STRUCTURAL VARIATION BENCHMARKING
Variant calling resulted in multiple overlapping structural variants of the same type, which can lead to inflated performance metrics, as each prediction may be counted as a true positive when compared to the truth dataset. For overlapping SV predictions of the same type and caller, a single call was selected using the criteria described in Supplemental_Table_S13.xls. When no discriminating information was available among overlapping calls, the final SV was selected randomly.
BENCHMARKING USING SIMULATED DATA
For simulated data, each predicted variant was classified as being either a true positive (TP) or false positive (FP) using the Bedtools intersect command. Predictions that overlapped (minimum reciprocal overlap = 0.5) with at least one simulated variant in the mock genome were classified as true positives (TP), and those calls that did not were classified as false positives (FP). Simulated variants that were not predicted were classified as false negatives (FN). These classifications were used to calculate the following performance metrics:
Precision is the ratio of true positives (TP) to the total predicted variants, and was calculated as follows:
Recall is the ration of true positives to the total number of simulated variants, and was calculated as follows:
The F1 score provides a measure of the prediction accuracy by taking the weighted average of the precision and recall. The F1 score was calculated as follows:
Because the precision, recall, and F1 scores were calculated using binary classifications, they provide an incomplete picture. For each SV type, predictions meeting the minimum reciprocal overlap threshold are classified as true positives but may include calls with imprecise breakpoints or size estimates. Furthermore, no single agreed upon threshold exists for reciprocal overlap and the cutoff value is typically chosen arbitrarily. SVs may be called as multiple separate events that span the true variant [15], none of which meet the minimum reciprocal overlap requirement. If identifying genomic regions containing structural variation is of greater importance to the analyst compared to the precision of individual calls, metrics that are calculated using presence or absence classifications may be unsuitable.
For example, true positives are not penalized for predicting breakpoints outside of the region of the structural variant. Similarly, true positives are not penalized for predicting breakpoints within the true breakpoints of the structural variant.
Therefore, the Jaccard index was calculated to measure the amount of overlap between the predicted variants and simulated variants.
The Jaccard index was calculated using the ratio of the number of base pairs in the intersection and union of the predicted and simulated variants:
Jaccard = prediction variants ∩ simulated variants / prediction variants ∪ simulated variants
BENCHMARKING USING REAL DATA
Due to the lack of real data containing known structural variants, the sets of genes spanned by SVs predicted by each caller were compared to evaluate the consistency between different tools. C. elegans genome annotations obtained from WormBase [45] (release = WBPS14) were used to identify which genes were spanned by SVs in the CeNDR data. The gene set was limited to protein-coding genes where at least 50% of the gene was covered by an SV. SVs larger than 50 Kbp were excluded due to this being the maximum size considered to be reliable in Assemblytics. For each comparison between callers, only predictions of a given SV type that spanned the exact same set of genes were considered to be in agreement with each other. The agreement between callers was depicted using UpSet plots [48].
DATA ACCESS
Custom Python (v3.7) and Bash scripts are available at https://github.com/kyleLesack/sv_calling_benchmarking.
Supporting information captions
Supplemental_Table_S1: Prediction of deletions from simulated short-read data
Supplemental_Table_S2: Prediction of duplications from simulated short-read data
Supplemental_Table_S3: Prediction of inversions from simulated short-read data
Supplemental_Table_S4: Degree of overlap for deletions spanning genes predicted from long-read data and FusorSV
Supplemental_Table_S5: Degree of overlap for deletions spanning genes predicted from short-read data
Supplemental_Table_S6: Degree of overlap for duplications spanning genes predicted from long-read data and FusorSV
Supplemental_Table_S7: Degree of overlap for duplications spanning genes predicted from short-read data
Supplemental_Table_S8: Degree of overlap for inversions spanning genes predicted from long-read data and FusorSV
Supplemental_Table_S9: Degree of overlap for inversions spanning genes predicted from short-read data
Supplemental_Table_S10: Deletion performance for different SVIM QUAL cut-off thresholds
Supplemental_Table_S11: Duplication performance for different SVIM QUAL cut-off thresholds
Supplemental_Table_S12: Inversion performance for different SVIM QUAL cut-off thresholds
Supplemental_Table_S13: Criteria used to select best overlapping call
ACKNOWLEDGEMENTS
This work was supported by Results Driven Agricultural Research (RDAR #2016F013R) to JDW, the Natural Sciences and Engineering Research Council of Canada (NSERC) through Discovery Grants (#06239-2015 and 04589-2020) to JDW, and an NSERC Collaborative Research and Training Experience Program (CREATE) program in Host-Parasite Interactions (#413888-2012) to JDW and others. ECA was supported by a National Science Foundation CAREER award.
Footnotes
The paper has been heavily updated following peer-review and will be submitted to a different journal.