HQAlign: Aligning nanopore reads for SV detection using current-level modeling

Dhaivat Joshi; Suhas Diggavi; Mark J.P. Chaisson; Sreeram Kannan

doi:10.1101/2023.01.08.523172

Abstract

Motivation Detection of structural variants (SV) from the alignment of sample DNA reads to the reference genome is an important problem in understanding human diseases. Long reads that can span repeat regions, along with an accurate alignment of these long reads play an important role in identifying novel SVs. Long read sequencers such as nanopore sequencing can address this problem by providing very long reads but with high error rates, making accurate alignment challenging. Many errors induced by nanopore sequencing have a bias because of the physics of the sequencing process and proper utilization of these error characteristics can play an important role in designing a robust aligner for SV detection problems. In this paper, we design and evaluate HQAlign, an aligner for SV detection using nanopore sequenced reads. The key ideas of HQAlign include (i) using basecalled nanopore reads along with the nanopore physics to improve alignments for SVs (ii) incorporating SV specific changes to the alignment pipeline (iii) adapting these into existing state-of-the-art long read aligner pipeline, minimap2 (v2.24), for efficient alignments.

Results We show that HQAlign captures about 4 − 6% complementary SVs across different datasets which are missed by minimap2 alignments while having a standalone performance at par with minimap2 for real nanopore reads data. For the common SV calls between HQAlign and minimap2, HQAlign improves the start and the end breakpoint accuracy for about 10 − 50% of SVs across different datasets. Moreover, HQAlign improves the alignment rate to 89.35% from minimap2 85.64% for nanopore reads alignment to recent telomere-to-telomere CHM13 assembly, and it improves to 86.65% from 83.48% for nanopore reads alignment to GRCh37 human genome.

Availability https://github.com/joshidhaivat/HQAlign.git

1 Introduction

Structural variations (SVs) are genomic alterations of size at least 50 bp long, including insertions, deletions, inversions, duplications, translocations or a combination of these types [1]. The study of these genetic variations has an important role in understanding human diseases, including cancer [2], and begins with sequence alignment from the sample back to the reference genome. Accurate alignment of short reads from high throughput sequencing poses a challenge, especially, in the repetitive regions of the genome which are also the hotspots of nearly 70% of the observed structural variations [3].

Long read sequencing technologies have addressed this problem by producing reads that are longer than the repeat regions, therefore, enabling the detection of variants in the repeat regions at the cost of higher error rates than short read sequencing technologies. This high error rates in the long reads lead to non-contiguous alignment which poses a challenge in variant detection problem, especially, in the repeat regions.

Nanopore sequencing [4, 5] is a long read sequencing technology that provides reads (with average read length 10-kb and the longest read sequenced more than 2-Mb long) that can span these repetitive regions but it has a high error rate of (average) 10%. This high error rates result in low accuracy alignments [6] using state-of-the-art methods including minimap2 (v2.24) [7] which is a fast method designed for the computationally challenging task of long sequence alignment. This problem is further amplified in the repetitive regions such as variable-number tandem repeats (VNTR) region that accounts for a significant fraction of SVs [8, 9]. However, these errors in nanopore sequencing have a bias induced from nanopore physics which is missed by many long read aligners since they consider the errors as independent insertions, deletions, and substitutions. In nanopore sequencing, a DNA strand migrates through the nanopore, and an ionic current according to the nucleotide sequence in or near the nanopore is established. However, because of the physics and non-idealities of the nanopore sequencing, each current level recorded depends on a Q-mer (a set of Q consecutive nucleotide bases which influence the measurement in the nanopore) [10, 11, 12]. These current readings are translated back to nucleotide sequences by basecalling algorithms. Therefore, the error biases could be introduced in basecalling, especially, between different Q-mers that have similar current levels. This similarity in the median current levels for different Q-mers is captured by the Q-mer map as shown in Figure 1b. A Q-mer map represents the median current level for different Q-mers (Q = 6) for nanopore flow cell. It is evident from this figure that there is a significant overlap between the current levels observed for different Q-mers migrating the nanopore. We propose a new alignment method, HQAlign (which is based on QAlign [13]), which is designed specifically for detecting SVs while incorporating the error biases inherent in the nanopore sequencing process. HQAlign pipeline is modified to enable detection of inversion variants which was not feasible with the earlier QAlign pipeline (refer to methods section 2.2 for details).

Figure 1:

(a) An example to illustrate the error biases in nanopore basecalled reads which can be resolved through the Q-mer map ability of HQAlign to perform accurate alignment despite of the errors (the edit distance used here is domain specific and is used to demonstrate accuracy of the alignment). (b) Q-mer map for Nanopore R9.4 1D flow cell (for Q = 6). It represents the physics of nanopore. The median current value along with the standard deviation (as error bars) are plotted for all 6-mers in the Q-mer map for R9.4 1D nanopore flow cell (the Q-mers are sorted in increasing median current levels). Note that the difference between the median current levels of any two consecutive Q-mers is very small, therefore, resulting in large overlaps. (c) An example from PromethION R9.4.1 ONT data in the neighborhood of a SV in repeat region showing the two different nucleotide sequences have similar current levels and therefore, the edit distance as observed through the lens of quantized sequence is significantly lower in HQ3.

HQAlign takes the dependence of Q-mer map into account to perform accurate alignment with modifications specifically for discovery of SVs. Figure 1a gives an example where a DNA sequence (GCATGACAGG) is sequenced incorrectly as (CGGCAACCGA) due to the error bias in nanopore sequencer. Therefore, the sequences are different in the nucleotide space but they are identical in the Q-mer map space. It is important to note that no additional soft information is used to establish this identity such as raw nanopore current values for the nanopore reads. Instead, the nucleotide sequences that have indistinguishable current levels from the lens of the Q-mer map are mapped to a common quantized sequence. A nucleotide sequence is converted to a quantized sequence by first converting the nucleotide sequence to a sequence of current levels using the Q-mer map and then converting the sequence of (continuous) current levels to a (finite level) quantized sequence by hard thresholding the current levels (refer to Supplementary material section 1.1 for more details). Therefore, the additional information about the raw current signals is not used in the quantization process but only the Q-mer map is utilized. This process is explained in detail in Supplementary Figure 1. Further, the quantization of continuous current levels to finite discrete levels enables the use of existing software pipelines of state-of-the-art long read aligners such as minimap2 as the core seed and extend algorithm for the alignment of quantized sequences.

In HQAlign, we first perform the alignment of reads onto genome using minimap2 to determine the region of interest where a read can possibly align to, and then re-align the quantized read to the quantized genome region from the first step. This helps in performing an accurate alignment of the read to the region of genome without dropping the frequently occurring seed matches from the chain in minimap2 algorithm while taking the error biases of nanopore sequencing into account through quantized sequences. Moreover, HQAlign pipeline enables detection of inversion variants unlike QAlign pipeline. In QAlign, the quantized reverse complement of read is aligned separately to the quantized genome, therefore, the alignment of inverted sequence is not observed in QAlign (as shown in Figure 2a). However, in HQAlign, we have modified the minimap2 pipeline to align the reverse complement of quantized read along with the aligning the forward quantized read sequence to the quantized genome, simultaneously. This is necessary for detection of inverted alignment using quantized sequences because unlike nucleotide sequences where minimap2 can inherently produce the reverse complement of the input nucleotide query, the reverse complement of the quantized sequence is to be computed separately. Further, HQAlign is about 2x faster than QAlign as the seed search domain is reduced to a region of interest determined in the first step of the pipeline for the quantized sequences (as shown in Table 1).

View this table:

Table 1:

Comparison of computation time for alignment of 500k randomly sampled ONT reads to CHM13 assembly using 20 threads for each method.

Figure 2:

(a) An example to demonstrate the ability of HQAlign pipeline to align inverted sequences where QAlign fails (b) An example of HQAlign pipeline. (c) An example of read-to-genome alignment. (d) Comparison of SV in truth set to SV determined by method: minimap2/HQ3.

Figure 1c demonstrate an example from real ONT reads data in a repeat region (note that a pattern of a few consecutive nucleotide bases is repeated in the example) that is flanking around an insertion structural variant. Minimap2 alignment of nucleotide reference and read (both of length 356 from the region highlighted with a box) have an edit distance of 66 whereas the HQ3 alignment (HQ3 is an alignment from HQAlign pipeline where the nucleotide sequences are translated to three level quantized sequences, refer to section 2.2 for details) of quantized reference and read sequences from the same region have a significantly smaller edit distance of 7. This is because the current level sequence (by converting the nucleotide sequences using the Q-mer map in Figure 1b) for the reference and the read are very similar. Therefore, the sequences that are far apart in nucleotide space are inherently very similar in the HQ3 space in terms of the edit distance in the transformed space.

We show that HQAlign gives significant performance improvements in quality of read alignment across real and simulated data. The well-aligned reads (a read is defined as well aligned if at least 90% of the read is aligned on the genome with a mapping quality more than 20) improves to 86.65% with HQ3 from 83.48% with minimap2 (v2.24) for the alignment of ONT reads from HG002 sample to GRCh37 human genome. The metric improves to 89.35% from 85.64% for HG002 reads alignment to T2T CHM13 assembly [14], and improves to 81.57% from 81.01% for the simulated reads data. These results are presented in the results section 3.1.2 Table 2.

View this table:

Table 2:

Comparison for the percentage of well-aligned reads onto genome, and slope of the regression line (for normalized edit distance comparison plot of HQ3 vs minimap2 alignments) with randomly sampled reads for each datasets. The slope of the regression line shows the average gain in the normalized edit distance.

In terms of SV detection, HQAlign has F1 score at par with minimap2 (v2.24) with Sniffles2 [15] as the variant calling algorithm across both real and simulated dataset (Table 3). However, both HQAlign and minimap2 captures many complementary calls (4 − 6%) which are missed by the other method (as shown in Figure 6, 7, 8, 9, 10). For instance, the complementary HQAlign calls are SVs that are uniquely called by HQAlign or labeled missed in minimap2 due to breaking in the SV and vice-versa for the complementary calls in minimap2. Further, the analysis of common true positive SV calls in HQAlign and minimap2 against the truth set shows that HQAlign has on average a significant improvement (10 − 50%, from the slope of the regression line in Figures 11, 12, 13, 14, and weighted average across all datasets for 39% SVs) in the breakpoint accuracy than minimap2 for the calls with difference in breakpoint greater than 50 bp (breakpoint accuracy is determined from the difference in the start and end breakpoints of a SV with respect to the match SV in truth set, therefore, lower the difference higher is the breakpoint accuracy, refer to section 2.3 for precise definition). Moreover, for the common true positive calls, HQAlign has (on average) better SV length similarity than minimap2 (when SV length similarity is less than 0.95, SV length similarity is a measure of how similar is the length of SV from an alignment method relative to the match SV in truth set; refer to section 2.3 for a precise definition) as shown in Figures 11, 12, 13, and 14.

View this table:

Table 3:

Comparison for precision, recall and and F1 score for SV calls made by HQ3, minimap2, and the Union model.

Figure 3: HG002 nanopore long DNA reads alignment onto T2T CHM13 genome.

(a) Comparison of normalized edit distance for HG002 R9.4.1 PromethION reads data. Smaller values for normalized edit distance is desirable as it represents better alignment. The slope of the regression line is 0.79 < 1, therefore, representing better alignments with HQ3 than minimap2 alignments for same reads on average. (b) Comparison of normalized alignment length for HG002 R9.4.1 PromethION reads data. Normalized alignment length of 1 is desirable as it represents that entire read is aligned. The majority of the reads are above y = x line representing longer alignment length in HQ3 than minimap2 alignment.

Figure 4: HG002 nanopore long DNA reads alignment onto GRCh37 genome.

(a) Comparison of normalized edit distance for HG002 R9.4.1 PromethION reads data. Smaller values for normalized edit distance is desirable as it represents better alignment. The slope of the regression line is 0.82 < 1, therefore, representing better alignments with HQ3 than minimap2 alignments for same reads on average. (b) Comparison of normalized alignment length for HG002 R9.4.1 PromethION reads data. Normalized alignment length of 1 is desirable as it represents that entire read is aligned. The majority of the reads are above y = x line representing longer alignment length in HQ3 than minimap2 alignment.

Figure 5: Simulated nanopore reads alignment onto T2T CHM13 genome.

(a) Comparison of normalized edit distance for simulated nanopore reads data. Smaller values for normalized edit distance is desirable as it represents better alignment. The slope of the regression line is 0.99 < 1, therefore, representing marginally better alignments with HQ3 than minimap2 alignments for same reads on average. (b) Comparison of normalized alignment length for simulated nanopore reads data. Normalized alignment length of 1 is desirable as it represents a contiguous alignment of the entire read.