ABSTRACT
Background The utility of long-read genome sequencing platforms has been shown in many fields including whole genome assembly, metagenomics, and amplicon sequencing. Less clear is the applicability of long reads to reference-guided human genomics, the foundation of genomic medicine. Here, we benchmark available platform-agnostic alignment tools on datasets from nanopore and single-molecule real-time platforms to understand their suitability in producing a genome representation.
Results For this study, we leveraged publicly-available data from sample NA12878 generated on Oxford Nanopore and Pacific Biosciences platforms. Each tool that was benchmarked, including GraphMap, minimap2, and NGMLR, produced the same alignment file each time. However, the different tools widely disagreed on which reads to leave unaligned, affecting the end genome coverage and the number and locations of discoverable breakpoints. Only minimap2 was computationally lightweight enough for use at scale. No alignment from one tool independently resolved all large structural variants (10,000-100,000 basepairs) present in the Database of Genome Variants (DGV) for sample NA12878. For variants larger than 1,000,000 basepairs, nanopore sequence aligned with minimap2 and NGMLR, and single-molecule real-time sequence aligned with NGMLR contained more breakpoints than are present in DGV.
Conclusions When computational resources are not a limiting factor, it should be best practice to use an analysis pipeline that generates alignments with both minimap2 and NGMLR, as neither results in a comprehensive genome representation. When computational resources are limited, use of minimap2 for human genome alignment produces files sufficient to answer hypotheses and generate new questions.
BACKGROUND
The diverse ecosystem of DNA preparation, sequencing, and mapping technologies is capable of generating computational representations of genome biology through different chemistries and processes (1–5). Sequencing-by-synthesis platforms produce highly accurate sequence data in the form of short reads (<300 basepairs, bp) (6) from high molecular weight DNA inputs. These molecular input libraries can be prepared for short-read sequencing (SRS) that are oriented in context long-range genome position through high-throughput chromatin confirmation capture (Hi-C) methods (7). In contrast, single-molecule real-time platforms can produce increasingly accurate reads on molecules >10 kilobasepairs (kbp) (8) while nanopore-based sequencing (9) and nanochannel-based mapping platforms (10) can sequence or visualize megabase-length DNA molecules (11). Each of these platforms has different abilities and available tools contributing to the diversity of projects enabled by these technologies (12,13).
Optical genome mapping (developed by Bionano Genomics) excels at detecting complex structural variants (SV), such as balanced translocations and deletions/insertions in the 1 kb to 1 Mb range, and its clinical utility was demonstrated in Duchenne (DMD) or facioscapulohumeral (FSHD) muscular dystrophies (14,15). But long-read sequence (LRS), developed by Pacific Biosciences and Oxford Nanopore Technologies, is the most appropriate technology to detect smaller variants in the 50 bp-1 kb range (16), especially in repetitive regions, and should therefore bring new diagnosis potential for diseases of trinucleotide repeat expansion and genome instability such as Huntington’s Disease and Myotonic Dystrophies (17–21), as well as for discovery of SVs affecting regulatory regions of known genes. Fulfilling the promise of LRS for medical genetics requires understanding the tools that power the technology and their respective strengths.
LRS platforms have been shown to be capable of generating single, phaseable reads spanning repetitive or complex genomic regions that remain unresolved in current human genome assemblies (8,9,22,23). This has implication for resolution of highly homologous pseudogenes and large SV, even in low-complexity genomic regions. Read lengths in the tens of kbp have further allowed for end-to-end viral genome sequencing (24), while read lengths in the millions of bp have the potential to span whole mammalian chromosomes (9,25). This potential for more contiguous de novo human assembly has led to studies to specifically improve and benchmark basecalling and polishing tools (26), as well as assembly tools (27–30) that can be applied to human genomics.
For reference genome-guided experiments, LRS has proven useful for amplicon sequencing in cancer detection (31) and for metagenomics (32–34), but the field is still in early days for assessment of variation in high-coverage human whole genome sequence. Indeed, high-coverage human whole genome sequence on LRS platforms has yet to be undertaken on the scale of projects based on Illumina-developed SRS platform (35,36). This difference in data availability is reflected in the number of tools for SV calling - including at least 21 for SRS and five for LRS. (13). It is certain that, due to the richness of the sequence data and diminishing cost of generating those data, these LRS platforms will be used in large-scale reference-guided whole human genome projects that are presently dominated by SRS data production (37).
The value of LRS in resolving SV has been demonstrated by efforts aggregating callsets across platforms and technologies to deeply characterize a genome (16). These experiments are concerned mainly with understanding the full scope of the architecture of a genome, rather than contrasting differences between results of one alignment tool or another. Each of these tools showcase their utility and strengths in their initial publications, especially in terms of the mapping quality of reads to the reference genome and precision and sensitivity to preserve known variants in synthetic or downsampled genomic data, however, there have been no studies that specifically benchmark LRS platforms and tools for reference-guided experiments.
To address this gap, we benchmarked LRS alignment tools with the datasets generated from the Joint Initiative for Metrology in Biology’s Genome in a Bottle Initiative (GIAB) sample, NA12878, which was sequenced on Oxford Nanopore’s MinION-platform by the Whole Genome Sequencing Consortium (9) and on the Pacific Biosciences SMRT-sequence data by the National Institute of Standards and Technology (38). We compared computational performance (peak memory utilization, central processing unit (CPU) time, file size/storage requirements), genome coverage depth (x-times), and quantified the reads left unaligned in any given experiment. Since the resolution of large SVs is a key application of this technology, and also allows us to compare the differences in genome alignments in an aggregate way, we ran the SV-calling tool Sniffles to highlight differences in breakpoint location in each binary alignment map (BAM) file (39). Taken together, these experiments present a comprehensive view of differences in the products of LRS whole genome alignment pipelines.
RESULTS
All experiments were performed on the NA12878 sample, for which publicly available datasets exist for both nanopore and SMRT sequencing, and rigorously annotated variants are available in the Database of Genomic Variants (DGV) (40). We present our results in four sections: 1) the tools that passed our exclusion criteria, 2) computational performance and benchmarking, 3) an analysis of aligned and unaligned reads, 4) an analysis of structural variation present in each alignment compared to a baseline.
1. Tools that passed the exclusion criteria
Following a literature review of available alignment tools, we established a set of inclusion criteria (see Material and Methods), which accounted for both types of LRS data (a tool must be able to handle both nanopore and SMRT reads), as well as the state of the field in terms of software updates (must not have been superseded by another tool) and adoption of the tool, as measured by citations (must be cited by manuscripts written by outside parties). All surveyed tools and criteria are outlined in Table 1.
Three alignment tools, GraphMap, minimap2, and NGMLR, met inclusion criteria and were included in this study (39,41,42). Kart was excluded because, while benchmarked in 2018 against minimap2 (42) it had not yet been cited by any work indexed in the Web of Knowledge (43). Mashmap2 was also excluded, as it only had 10 citations. LAST and BLASR were excluded because they did not have modes to account for both types of data (44,45). BWA-MEM was excluded because it was superseded by minimap2 (46).
2. Computational performance and benchmarking
Computations were run three times on full node capacity on our university’s high-performance compute cluster (HPC), including 40 CPUs with a configuration described in the methods. Further one-off runs were performed with restricted node capacity of 30, 20, and 10 CPUs. All tools were run on default settings to reflect typical use in exploratory studies. Summaries of findings can be examined in Table 2. Full reports on each run can be examined in Supplemental Table S1, while a composite table of the output of samtools bamstats is available as Supplemental Table S2.
Minimap2 was the least resource-demanding tool
Minimap2 successfully aligned nanopore data every time with unrestricted and restricted resources. Unrestricted runs used an average of 14.2 gigabytes (Gb) and the jobs took ~540 CPU hours (~13 wall clock hours) to complete.
Minimap2 also successfully aligned SMRT data every time with unrestricted or restricted resources. The runs used an average of 16.9 Gb and the jobs took ~641 CPU hours (~16 wall clock hours) to complete, a little longer than for the nanopore dataset.
Memory usage (see peak_rss columns in Supplemental Table S1) and runtime were consistent across triplicate runs with unrestricted resources and did not change with restriction of resources when the tool was used with either dataset. The consistency of results, as well as the speed and relatively low computational demands of minimap2 make it a strong candidate for inclusion in clinical analysis pipelines.
GraphMap was the most resource-intensive tool, resulting in 3/12 run failures
GraphMap successfully aligned nanopore data two out of three times in both restricted and unrestricted resources runs. For the two successful unrestricted runs, an average of ~96 Gb memory was used and the jobs took an average of ~8,328 CPU hours (~208 wall clock hours) to complete. This was 6.6 times more memory and 13 times longer than Minimap2. Memory usage was unchanged with resource restriction, but runtime did increase by approximately 150,000 CPU seconds (~7.5 hours).
GraphMap successfully aligned SMRT data every time with unrestricted resources and two out of three times with restricted resources. These runs used an average of ~50 Gb memory and the jobs took an average of ~10,642 CPU hours (~266 wall clock hours, 11 days) to complete. All unrestricted resource jobs used ~50 Gb of memory. One restricted run (30 threads) used ~49 Gb of memory while one (20) used ~46 Gb. The restricted run limited to 10 threads failed due to time out at 14 wall clock days.
Runtime across all runs did not directly relate to memory usage or resource restriction (Table 2, Supplemental Table S1. The resource-intensive requirements of GraphMap for whole human genome alignment combined with failures to run in both datasets likely precludes including it in the design of most analysis pipelines, however alignment files from successful runs were included in further analyses.
NGMLR completed all tasks, but performance was very resource-dependent
NGMLR successfully aligned nanopore data every time with unrestricted and restricted resources. The unrestricted runs used an average of 33.8 Gb the jobs took ~2,539 CPU hours (~63.5 wall clock hours) to complete, intermediary between Minimap2 and GraphMap.
NGMLR was much faster on SMRT data. It successfully aligned every time even with restricted resources. The runs used an average of 29.7 Gb the jobs took ~658 CPU hours (~16.5 wall clock hours, almost 4 times faster than on the nanopore data) to complete, on par with the time taken by Minimap2.
Memory usage and runtime were consistent with the average across triplicate runs, but, unlike the other tools, NGMLR’s performance was resource-dependent: runtime increased and memory usage decreased with resource restriction. While more computationally intensive than minimap2, there are no glaring concerns for NGMLR’s suitability for whole human genome alignment-based pipelines, as there is with GraphMap.
3. Whole genome alignment
Aligned reads
In every experiment that completed successfully, alignments of the same genome with the same tool had the same genome coverage, whether resource-restricted or not (Supplemental Table S1, column O). For this reason, one file generated from the first unrestricted run was used for subsequent analysis (shown in Table 2). Genome coverages of ~30x and ~44x were reported for nanopore (9) and SMRT (38,47) analyses respectively. We calculated the actual coverage after read alignment to the hg38 reference by each tool.
GraphMap produced higher coverage genomes than expected on both nanopore (32.8 x) and SMRT (45.4x) datasets.
Minimap2 produced higher coverage nanopore genomes (33.4x), and expected (within 1x) coverage SMRT genomes (43.6).
NGMLR produced higher coverage nanopore genomes (31.1x) and lower coverage (39.7x) SMRT genomes.
For nanopore data, coverage was higher than expected, by 3.7%, 9.3% and 11.3% for NGMLR, GraphMap, and Minimap2 respectively. For SMRT data, GraphMap also produced slightly higher than expected coverage (by 3.2%), while NGMLR came in at 9.8% lower than expected. Minimap2 showed a small difference (−0.9%). These results are not sufficient evidence to pass judgment on the utility of any tool, but prompted us to examine where those discrepancies come from. We analyzed the unmapped reads to find the overlap in consensus-excluded reads.
Unaligned reads
The readname assigned to each read in a fastq retained in the BAM allowed us to write an R script to directly compare the lists of reads that were not included in the alignment by each alignment tool (Fig. 1).
All tools agreed to leave ~1.6 million nanopore reads unaligned
For the ~30x nanopore dataset, all three tools agreed to exclude the same ~1.6 million reads. NGMLR left the highest number of reads unaligned, ~3.3 million, which explains why it produced the lowest coverage genome. It agreed with the other tools on approximately half of its discarded reads. Minimap2 and NGMLR agreed to leave a further ~0.5 million reads unaligned, while NGMLR and GraphMap agree on a separate ~0.48 million unaligned reads and GraphMap and Minimap2 agree on ~0.4 million further unaligned reads.
All tools agreed to leave ~1.2 million SMRT reads unaligned
Similarly, for the ~44x SMRT dataset, all three tools agreed to exclude the same ~1.2 million reads. NGMLR left the highest number of reads unaligned, ~6 million, which explains why it produced the lowest coverage genome. Minimap2 and NGMLR agree to leave a further ~1.6 million reads unaligned, while NGMLR and GraphMap agree on a separate ~0.78 million unaligned reads and GraphMap and Minimap2 agree on ~0.01 million further unaligned reads.
Perhaps the most interesting feature in this set of experiments is that NGMLR independently excludes ~6.5 times more reads than GraphMap or Minimap2. When compared to the nanopore data, this is ~1.5 times more reads than are independently excluded by NGMLR.
4. Structural variation differences across alignments
By examining the coverage of the genome and the overlap of the discarded reads, it became clear that more detail was needed to understand the differences in alignment. To globally compare the alignments to each other, and assess their usability for variant calling, we looked to the SV known to be present in the NA12878 genome as curated by the DGV resource.
To compare the SV breakpoints preserved in the genome alignments created in this study, we leveraged the sniffles SV caller because of its highly-detailed output files. We used the curated SV present in DGV, annotated for presence in NA12878 as reference. Due to the different nomenclature for SV type in the sniffles output VCF specification document (48) and the DGV annotations, comparisons were limited to the four classes of variant that were most unambiguously defined in both: insertions, deletions, duplications, and inversions.
Sniffles variants were graphed by SV length on the X-axis in shades of blue, contrasted with DGV variants in red, organized by platform and SV type (Fig. 2). Deletions and insertions were overestimated in all data sets compared to the number curated in DGV. Conversely, the alignment files preserved fewer breakpoints recognized by sniffles as duplications and inversions than are present in DGV. Globally, there are more indels <10,000 bp present in VCFs than are present in DGV. This trend largely reverses above 10,000 bp, which is reflected in the literature (16). The summary of variants is presented in Table 3, with granular counts by SV length and type found in Supplemental Table S3.
a. Differences in variants among alignment files
Deletions(Fig. 2, A, B)
The total number of identified deletions varied widely, from ~6,500 to ~14,300, across alignment files. For all three alignment tools, the nanopore data yielded the most sniffles-identified deletions. Minimap2 preserved the most deletions on both datasets. GraphMap-aligned genomes did not allow detection of any deletions larger than 20 kbp in nanopore data and 6 kbp in SMRT data.
Insertions(Fig. 2, C, D)
The total number of identified insertions also varied (from ~6,000 to ~13,700). In contrast to deletions, the SMRT data yielded the most insertions for all three alignment tools. Minimap2 again allowed identification of the most insertions on both datasets. The three tools agreed for insertions in the 1,000-4,000 bp range. Strikingly, none of the tools allowed identification of insertions larger than 8,000 bp in nanopore sequence and 4,000 bp in SMRT sequence.
Duplications and inversions GraphMap preserved no breakpoints that sniffles recognized as duplications or inversions
Both minimap2 and NGMLR datasets produced similar numbers of inversions (within 30-50 variants; Fig. 2, E, F) and inversions (within 4-7 variants; Fig. 2, G, H). NGMLR allowed calling of the highest number of duplications and inversions on both datasets, outperforming minimap2 in all but three SV length bins for duplications. This was particularly evident in the 6-10 kbp size range where no duplications were called from the minimap2 datasets (5-10 kbp in nanopore data).
b. Comparison to variants in DGV
Broad trends emerged with comparison to a truthset. Deletions and insertions were overestimated in all data sets compared to the number curated in DGV. Conversely, far fewer duplications and inversions than are present in DGV were called on the LRS data. While there were more indels <10,000 bp present in VCFs than are have been curated in DGV, this trend largely reversed above 10,000 bp, where most known variants were missed. DGV variants are shown in red in Fig. 2; details of counts are shown in Supplemental Table S3. While these discrepancies may largely reflect the characteristics of LRS output since DGV variants are curated from several data sources, our analysis showed that the different alignment tools yielded different callsets from each other as well.
Deletions (Fig. 2A, B)
The overestimation of deletions of all sizes called in the LRS datasets compared to DGV ranged from 4.8 times more in the SMRT/GraphMap data to 10.6 times more in the nanopore/Minimap2 data. For variants smaller than 10 kbp, the alignment files contained more SVs than are present in DGV. Indeed, for both datasets and all three alignment tools, 10 times more SVs between 51-1000 bp were present in VCFs than are present in DGV. Compared to the variants in DGV, minimap2 and NGMLR resolve tens of variants less in each bin greater than 20,000 bp but less than 100,000 bp.
Insertions (Fig. 2C, D)
Sniffles identified more insertions ≤ 6,000 bp in nanopore alignments and ≤ 4,000 bp in SMRT alignments than are present in DGV. For both datasets and all three alignment tools, more than 100 times more insertions sized 51-1000 bp are present in VCFs compared to DGV. In contrast, none of the tools were able to identify large insertions beyond 8 kbp or 4 kbp on the nanopore or SMRT data, respectively. DGV contains 934 insertions between 9 and 80 kbp, yet none are resolved by Sniffles on any alignment file from the nanopore or SMRT set.
Duplications (Fig. 2E, F)
None of the tools performed well for identification of duplications. NGMLR resolved at most half as many duplications present in DGV while minimap2 resolved approximately 10%. GraphMap identified none. NGMLR detected ~50 variants larger than 100,000 bp, compared to the 19 variants present in DGV.
Inversions (Fig. 2G, H)
GraphMap preserved no breakpoints Sniffles identified as inversions; minimap2 and NGMLR alignments led to overestimation of called small (51-1,000 bp) inversions. In total, DGV contains more inversions than are present in all VCFs. Less than a quarter of DGV inversions are greater than 100 kbp, compared to around a third of similarly sized variants in the VCFs, regardless of sequencing platform and alignment tool.
The fact that none of the callsets allowed identification of all large SVs of size greater than 10 kbp present in DGV is a significant limitation when trying to use this technology to identify the genetic conditions of known SV etiology in patient samples. Since these variants have been curated by the community and have a high likelihood of being truly present in the studied genome, it means that, rather than a specific alignment tool underperforming, the long-read sequence platforms fall short at resolution of these variants. This fact is critical when considering the specifics of diseases where structural variation is a recognized etiology, such as DMD or FSHD, where variants range between the tens and hundreds of thousands of kbp (14,15), beyond the size of variants accurately detected in this study.
DISCUSSION
In this study we have highlighted the key differences between alignment files generated by three tools that have seen wide adoption for experiments involving the reference-guided alignment of large genomes. By using a well-characterized genome standard, NA12878, we were able to directly compare the performance of each tool on the sequencing datasets obtained on different platforms. We further analyzed the reads that were included or excluded from the alignment, and how these reads alignments revealed breakpoints that could be resolved as structural variants in genome architecture. We then compared those variants to a set of previously published variants discovered through multiple platforms.
Reassuringly, each alignment tool was internally consistent: when an alignment tool was given the same fastq and the same reference genome, it produced the same result as judged by bamstats and sniffles variant callsets. However, when looking across the alignment files produced by different tools on the same sequence data, the representations of the genome diverged in terms of which reads were included or excluded, and the numbers and types of variants that were present in VCFs.
This is impactful because the potential high value of LRS data in terms of potential for phasing and identification of epigenetic DNA modification (49). Since the majority of experiments that leverage large scale population surveys can be expected to rely on reference-guided alignment rather than de novo assembly because of both the cost and speed of analysis (50), it is key to understand the idiosyncrasies of each type of alignment files. Furthermore, clinical experiments in genomic medicine face human time constraints – speedier analyses will have higher appeal and adoption.
At this time, GraphMap does not show utility for producing whole genome alignments that include structural variations when run on default parameters. The resource usage was large and the time to complete the computation long with no added benefits to resolve SVs shown in this study. This is not unexpected, as the tool was designed in large part to increase single nucleotide variant sensitivity in noisy nanopore-sequenced reads (41); as such, the algorithm used smooths out some of the complexity of genome architecture. This reduction in complexity holds true for SMRT data as well, removing some of the key benefits of both LRS platforms.
Minimap2 used the least computational resources and ran successfully in the fastest time, an important point, should these platforms be tied closer to bedside applications. On data from both platforms, it allowed calling of the most insertions and deletions, but fell short on inversions and duplications.
NGMLR was the most discerning aligner, in that it left the highest number of reads unmapped. It used more compute power than Minimap2, but handled the SMRT-based data in nearly the same amount of time, while taking considerably more time when handling nanopore-based data. While it was designed specifically to resolve structural variation (39), it calls a high number of very large variants (>100,000 bp) that have not been validated with other methodologies and curated in DGV.
There is a great divergence in sniffles-called variants from alignment files generated by all tools from the variants present in DGV. This is a concerning expansion of seminal findings in a previous study (16) as none of the sniffles VCFs mirror the SVs present in the high-quality curated DGV database. Small variants are present in sniffles VCFs a much higher level than in DGV, and, conversely, between 10 and 100 kbp, each alignment tool produces an input for Sniffles that falls short of the number of variants in DGV.
The points above are critical in designing pipelines for genome analysis and structural variant discovery. In computationally unlimited research settings with onsite high-performance clusters, there is value added in the generation of alignment files from both minimap2 and NGMLR. These two perspectives on the same genome will account for some of the inherent differences of each tool and the algorithms they use to handle read alignment (51). If computational resources are limited, minimap2 is the best choice to move the greatest number of genomes through the pipeline quickly; however, the loss of comprehensiveness must be considered in cases where a suspected variant is not found.
This is impactful in genomic medicine. For example, as variants range from tens of kbp in FSHD and hundreds of kbp to mbp in DMD, diagnostic of these disorders will likely not benefit from data generated on LRS platforms at present, underscoring the need for optical mapping or array-based technologies. However, disorders resulting from smaller SVs such as Huntington’s (~18-540 bp) (52), myotonic dystrophy 1 (~15-153 bp) and myotonic dystrophy 2 (~338-143,000 bp) (53) could be good candidates for deep study with LRS platforms based on the variants present in alignment files from minimap2 and NGMLR. Accordingly, LRS has been used to identify variants in many such disorders (21,54).
If pathogenic loci are known, a high diagnostic yield may be obtained by generating maps with each available alignment tool, and use a structural variant caller such as NanoSV (55). Unlike Sniffles, which provides a call of type of SV (deletion, insertion etc.), NanoSV only identifies breakpoints in the alignment, without assigning those breakpoints an SV type. A robust comparison of SV callers on nanopore datasets highlighted the relative strengths of variant calling pipelines and may help users determine the best caller for their experiments (56). NanoSV may be suitable for identifying breakpoints missed by Sniffles, but comes with the further caveat that it is resource-intensive and may not scale in a clinical setting without vast computational resources (56).
Answers to biological questions can also leverage tools that work with data upstream and downstream of whole genome alignments; indeed, alternate solutions to discover differences in genome structure have been proposed that leverage new tools. Tools designed to specifically identify triplet repeat expansions in LRS data sets, such as Repeat Hidden Markov Model (Repeat HHM) or Tandem-Genotypes, have shown the superiority of the approach compared to other techniques (19,20). RepeatHMM uses raw, unaligned reads in alternate ways than whole genome alignment to a reference (19). Rather than aligning LRS, reads can be analyzed independently for microsatellites with specialized tools. The tandem-genotypes tool makes use of a LAST whole-genome sequence alignment to detect copy number changes in the genome (20). In line with this fact, LRS has been used to identify variants in many disorders (21).
The discrepancies between the VCFs generated from alignment files starkly show the need to design experiments with the appreciation that the genomics ecosystem is not dominated by one platform or pipeline and requires a multifaceted approach to discovery. We are at a position where simply because the breakpoint is missing from the VCF from a long-read genome, we cannot say that it is not present. We must therefore look across platforms and datatypes for comprehensive genome representations (16).
CONCLUSIONS
As the cost of long-read sequencing catches up to that of inexpensive short-read sequencing, the inevitable boom in data production will require well thought-out analysis pipelines. Pipeline design always involves a set of tradeoffs. To accurately assess these tradeoffs, we must have a rigorously benchmarked view into the tools available to create the analytic product. Here, we looked to the differences in reference-guided human genome alignments to understand the difference in each tool’s alignment of the same genome, and how it affects a structural variant callset. This informs our conclusion that, regardless of sequencing platform, when computational resources are not a limiting factor, it should be best practice to align an LRS human genome with both minimap2 and NGMLR to gain better insight into the architecture of a genome of interest.
MATERIALS AND METHODS
1. Tool selection criteria
Exclusion criteria were designed after a comprehensive literature review of alignment tools. Starting with tools that were recommended by the developers of each platform, we examined the tools that were cited or benchmarked against new, platform specific tools. Since this yielded many software tools, we established exclusion criteria to limit our experiments to tools that had seen uptake by the community in terms of citations of that tool, while accounting for the idiosyncrasies of LRS sequencing platforms in their design and implementation. Since tools can be regularly updated or out-versioned, we wanted to use only the most up to date software at the time of analysis.
Inclusion criteria:
Must be cited in more than 100 original papers other than the paper where the tool is initially published
Must be designed with specific parameters for data produced on Oxford Nanopore Technologies and Pacific Biosciences platforms
Must not be superseded by a new tool
In our examination of structural variant calling pipelines, we elected to limit our study to the variant caller sniffles due to the fact that the output files were more descriptive than those produced by NanoSV. We were unable to install PBHoney on our cluster.
2. Data
Reference genome
GRCh38.p12 was accessed and downloaded on April 8, 2019 from https://www.ncbi.nlm.nih.gov/assembly/GCF_000001405.38/ (57).
Sequence data
Nanopore sequence data were accessed on April 8, 2019, version rel5-guppy-0.3.0-chunk10k.fastq, from AWS Open Data as made available by the Whole Genome Consortium. SMRT sequence data were accessed on May 28, 2019, version sorted_final_merged.bam from the NCBI FTP.
Database of genome variants
DGV data were accessed on January 30, 2020 from http://dgv.tcag.ca/dgv/docs/GRCh38_hg38_variants_2016-05-15.txt (60). The full set of variants was reduced to 11,042 variants confirmed to be present in NA12878. 14 of these variants were excluded as they were called on contigs present out of the main reference assembly contigs.
3. Hardware configuration
Computations were performed on the George Washington University High Performance Compute Center’s Pegasus Cluster on SLURM-managed default queue compute nodes with the following configuration: Dell PowerEdge R740 server with Dual 20-Core 3.70GHz Intel Xeon Gold 6148 processors, 192GB of 2666MHz DDR4 ECC Register DRAM, 800 GB SSD onboard storage (used for boot and local scratch space), and Mellanox EDR InfiniBand controller to 100GB fabric.
4. Whole genome alignment tool benchmarking and analysis
There versions of alignment tools used in this project included: GraphMap/0.5.2, minimap2/2.16, and ngmlr/0.2.7.
Computational metrics were printed from SLURM job records. Samtools 1.9 was used for all alignment manipulations, alignment read depth coverage calculations, and to extract unmapped reads (61). Samtools 1.10 was used to generate bamstats files. R 3.5.2 version Eggshell Igloo with the tidyverse packages were used to compare the readnames across unmapped read files to assess the degree of overlap of unmapped reads across alignment files. Genome coverage was calculated with samtools 1.9 assuming a 3 billion basepair genome. Binary conversion values were used for bytes (1,073,741,824) and kilobytes (1,048,576) to gigabytes. CPU and wall clock hours were calculated from CPU and wall clock seconds.
5. Data reshaping and visualization
As data were integrated across multiple sources and formats, they needed to be reshaped for comparison and visualization. For example, the multiple separators found in VCFs (commas and semicolon) are not directly usable in R or python. Relevant dataframes from genome alignment files were reshaped with shell scripts and r scripts whose methodology and key intermediate files are available as supplemental files to this manuscript.
6. Comparison of breakpoints
Structural variants (SV) were called with sniffles/1.0.11 with default parameters. Only variants called on the main GRCh38 assembly were included. R 3.5.2 was used to bin and graph structural variants by SV Type with tidyverse packages and cutr (62,63). Intermediary files and scripts are available upon request. Data from sniffles VCFs and from DGV were subsetted by comparable fields including SV Type, SV Length, and Chromosome since a common format was unavailable for direct comparison. Figures were made with R packages tidyverse/ggplot and VennDiagram (64).
DECLARATIONS
Ethics approval and consent to participate
N/A
Consent for publication
N/A
AVAILABILITY OF DATA AND MATERIALS
All data and tools used for this study are publicly available at the URLs indicated above in Methods. The scripts we used to process those data are available in supplementary materials as S4. Raw and post-processed VCFs, DGV source material, and other dataframes used to create the figures are available at: https://drive.google.com/drive/folders/1EImtHkYpOoUk7LOqPE94r6CZSxk_fw?usp=sharing (65)
Interested parties can further reach out to the authors for intermediate BAMs if needed.
COMPETING INTERESTS
EV is on the board of scientific advisors to Bionano Genomics
FUNDING
JL, ED, and EV are supported in part by A. James Clark Family Fund and Distinguished Professorship in Molecular Genetics at Children’s National Hospital. ED and EV are supported by NIH Award Number U01HG011745 from the National Human Genome Research Institute.
AUTHORS CONTRIBUTIONS
JL conceived of the study, performed the experiments, and wrote the manuscript. ED gave comment and direction on experimental design, and edited the manuscript. EV gave comment and direction on experimental design, edited the manuscript, and allocated laboratory resources to the project.
SUPPLEMENTAL TABLE OF CONTENTS
S1: Table of computational metrics
S2: Table of samtools bamstats outputs
S3: Table of structural variant calls by size
S4: Supplemental methods: scripts and methodology
ACKNOWLEDGEMENTS
Adam Kai Leung Wong, PhD, High Performance Computing Specialist for Genomics at the GWU Computational Biology Institute provided critical support in preparing the HPC for this study.
Footnotes
jlotempio{at}gwu.edu, edelot{at}gwu.edu, evilain{at}gwu.edu
https://drive.google.com/drive/folders/1EImtHkYpOoUk7LOqPE__94r6CZSxk_fw?usp=sharing