HELLO: A hybrid variant calling approach

Next Generation Sequencing (NGS) technologies that cost-effectively characterize genomic regions and identify sequence variations using short reads are the current standard for genome sequencing. However, calling small indels in low-complexity regions of the genome using NGS is challenging. Recent advances in Third Generation Sequencing (TGS) provide long reads, which call large-structural variants accurately. However, these reads have context-dependent indel errors in low-complexity regions, resulting in lower accuracy of small indel calls compared to NGS reads. When both small and large-structural variants need to be called, both NGS and TGS reads may be available. Integration of the two data types with unique error profiles could improve robustness of small variant calling in challenging cases. However, there isn’t currently such a method integrating both types of data. We present a novel method that integrates NGS and TGS reads to call small variants. We leverage the Mixture of Experts paradigm which uses an ensemble of Deep Neural Networks (DNN), each processing a different data type to make predictions. We present improvements in our DNN design compared to previous work such as sequence processing using one-dimensional convolutions instead of image processing using two-dimensional convolutions and an algorithm to efficiently process sites with many variant candidates, which help us reduce computations. Using our method to integrate Illumina and PacBio reads, we find a reduction in the number of erroneous small variant calls of up to ~30%, compared to the state-of-the-art using only Illumina data. We also find improvements in calling small indels in low-complexity regions.


Introduction
Short-read sequencing or Next Generation Sequencing (NGS) technology from Illumina produces highly accurate reads of length typically between 50-150 bases. The high accuracy is an attractive aspect for calling variants -including substitutions and small indels (indel length ≤ 50bp). However, some inherent challenges remain, especially in calling small indels in low-complexity regions , where short reads may have difficulty unambiguously characterizing variations. In our experiments, we find that over 88% of erroneous indel calls made by a current state-of-the-art variant caller occur in the lowcomplexity regions of the genome.
Third Generation Sequencing (TGS) technology can produce sequencing reads that are thousands of bases long. Recent advances in PacBio Circular Consensus Sequencing (CCS) technology has improved the consensus base calling accuracy significantly. However, the remaining errors are predominantly small indels, and these errors are highly context-dependent and systematic, causing difficulties for a variant caller to reach the right consensus from multiple reads covering such locations. Hence, while TGS has been shown to call substitutions and large structural variants at high accuracy, the systematic errors cause small indel calling accuracy to be worse than that achieved using NGS reads (Wenger, Peluso, & Hunkapiller, 2019).
Both NGS and TGS reads have strengths and weaknesses, which make them suited for calling different variant types. NGS reads provide accurate small variant calls, while TGS reads provide accurate structural variant calls. Both types of variants may be relevant to different types of clinical applications such as in individualized medicine, for example, due to their role in influencing the individual's susceptibility to diseases (Deng, Zhou, Fan, & Yuan, 2017) (Guo, et al., 2018). To call all types of variants, sequencing data from both sources should be considered and made available. Further, we speculate that if the sources of errors from the two platforms are in some sense complementary, it would be desirable to have a method that can take advantage of both data types to improve the accuracy of variant calling. Currently available methods for calling small variants can use only one of the two types of sequencing data exclusively, which is a limitation. Hence, using currently available tools, when both types of sequencing data are available, our only option is to call variants using one of the types of data exclusively.
Small variants are the most numerous types of variants in the genome. Both substitutions and small indels are widely associated with increased risk of many diseases such as different types of cancer (Stacey, et al., 2007) (Dai, et al., 2019) (Carlo, et al., 2020). Due to the complex nature of some of the associations, the role of variants is still being studied. Given that small variant callers today use only one type of sequencing data, and as a result consistently make erroneous calls in certain types of regions (e.g., indel calls in low-complexity regions) due to the error modes characteristic of a single sequencing technology, it is likely that the importance of variants in such regions may be less well-understood today. In addition, currently accepted benchmarks for variant calling such as Genome-In-A-Bottle (Zook, et al., 2019) have uncharacterized regions in the genome which may carry variants of significance. Some of these regions cannot be characterized due to the reliance, solely, on one type of sequencing data (namely short reads). Hence, a variant calling method using both short and long reads has the following potential advantages -(i) general reduction of errors in small variant calling, which is important from a clinical and research point of view, by leveraging long-read sequences that were sequenced for a different purpose (ii) reduction of errors in currently benchmarked, but challenging regions of the genome, improving our utilization of such regions in research and clinical settings, and (iii) enabling future studies into regions of the genome that are not currently benchmarked due to current attempts' reliance on only one type of sequencing data.
Deep Neural Networks (DNNs) have recently become the preferred method for pattern recognition tasks and have been applied successfully to the problem of small variant calling (Poplin, et al., 2018). This was enabled by the availability of high-quality, benchmarked truth-sets from efforts such as the Genome-In-A-Bottle (GIAB) and Platinum Genomes (Eberle, et al., 2016). When designing a system with DNNs, the underlying patterns are discovered by the DNN from training data rather than expressed manually through models and algorithms by the method developer. DNNs can learn complex patterns from large amounts of data that would be infeasible for a method developer to examine manually to cover all underlying cases. In theory, DNNs can represent true underlying patterns in data due to their ability to operate as universal function approximators (Hornik, Stinchcombe, & White, 1989).
DNNs also provide mechanisms to combine data from multiple platforms. The canonical method to combine data for analysis is straight-forward. A sufficiently large neural network, in theory, is able to represent any function ( , ) = , based on its universal approximation ability, where , can be sequencing data from two different sequencing platforms and , a set of variant calls. However, more sophisticated techniques can be designed which are more efficient in terms of computation, as well as more effective in terms of learning ability.
We have developed a caller, Hybrid Evaluation of smaLL genOmic variants (HELLO), which performs variant calling by integrating sequencing data from Illumina and PacBio reads. The method is based on the Mixture-of-Experts paradigm (Jordan & Jacobs, 1993) which allows multiple experts to make independent predictions, and a switch, or meta-expert to choose one of the experts' predictions. Each expert in our implementation is itself a DNN specialized to the error profile of a different type of data (Illumina, PacBio, or both). The meta-expert is a DNN that learns what types of genomic regions are suited for each expert. In addition, each expert and meta-expert uses specialized architectures tailored for genome sequencing data, which allows us to reduce the computational complexity of the analysis.
Applying our method to a whole-genome case-study for which both types of sequencing data are available, we find that it outperforms the state-of-the-art variant caller, that uses only Illumina reads.
Our analysis also shows that the method improves accuracy of calling small indels in low-complexity regions of the genome, which are harder to call using either Illumina or PacBio reads, indicating that the effect of the errors in the individual sequencing platforms may be countered to some extent by HELLO due to each having different error characteristics.

Multi-label classification
In a typical germline paradigm, the human DNA consists of two haplotypes that may be represented as a single unique sequence of bases, or two different sequences of bases at any site in the genome; that is, there are one or two unique true alleles per site. Based on read alignment data at a given site, one may infer candidate alleles at the site. Some candidate alleles may arise from errors in sequencing and/or alignment. Hence, the number of candidate alleles and true alleles can vary from site to site. A variant caller needs to select one or two alleles as true alleles from a set of candidate alleles of variable cardinality. In machine learning terms, a model used for variant calling should be capable of labeling one or two categories as positive (as true alleles), from a variable number of categories (candidate alleles). This task differs from most popular DNN-based classification tasks (e.g., ImageNet image classification) which select a single positive category per input data, from among a fixed number of categories.
Predicting multiple labels as opposed to a single label is called multi-label classification.
In our paper, we use the canonical method to solve multi-label classification; the independent prediction of the presence or absence of each category, otherwise called the binary-relevance method (R. Boutell, JieboLuo, XipengShen, & M.Brown, 2004). In terms of genome sequencing, this implies separately predicting the presence or absence of each candidate allele at a locus in the sequenced genome. The steps involved in predicting the ground-truth allele at a site are presented in Figure 1.

Mixture of Experts
The core computations in our method are based on a DNN trained using the Mixture-of-Experts (MoE) framework for ensemble learning. In the MoE framework, there are multiple experts or sub-DNNs (henceforth simply referred to as "DNNs"), each providing predictions for a given input. A switch associated with a meta-expert then selects from the predictions of the multiple experts, based on a probability distribution assigned to the experts based on the input data by another DNN, which we refer to as the meta-expert DNN. During training we try to improve the likelihood of making the correct predictions using the MoE. This involves not only improving the predictive power of each expert, but also the predictive power of the meta-expert DNN, which essentially determines which expert is the most suited for a given input.
The MoE framework is attractive when considering prediction from multiple data sources with different error characteristics, and under the assumption that different sequencing technologies may work well Our actual implementation is slightly more nuanced, with the addition of a third expert as depicted in where the combination of the data from the two technologies is useful, with the option of using the individual experts for cases where learning becomes difficult using the combined data, but is simpler using one of the two types of data.

MoE architecture
Each Illumina and PacBio read is encoded into a multi-dimensional numerical array representing a sequence. Encoding of bases quality scores etc into numerical values, follows DeepVariant (Poplin, et al., 2018). When analyzing a site, these encoded read sequences are fed into the MoE, along with information on candidate alleles and their supporting reads.
The block diagram of the MoE is in Figure 3. The MoE uses one-dimensional (1D) Convolutional Neural Networks (CNN) which use residual connections (He, Zhang, Ren, & Sun, 2015). The MoE has different stages for processing read-level, allele-level and site-level information. The first stage is comprised of a pair of CNNs ("Read-level convolver" in Figure 3(a)-(b)) that process representations of either Illumina or PacBio reads. The outputs, which we refer to as read-level features, are sequences that emphasize the spatial information in each read relevant to variant calling. In the next stage, read-level features corresponding to a candidate allele are accumulated and passed through a second pair of CNNs ("Allelelevel convolver" in Figure 3(a)-(b)) to obtain allele-level features summarizing evidence in favor of each candidate allele from each sequencing technology. Cumulation of allele-level features provide site-level features that summarize available data at a site, and the reference context. Combined allele-level and site-level features are also produced using additional CNNs (Figure 3(c)) which integrate evidence from both sequencing platforms to provide a single representation for allele-level evidence and site-level evidence. In the final stage, allele and site-level features are fed into four CNNs -three experts, and the meta-expert (Figure 3(d)). The NGS and TGS experts use allele and site-level features from the corresponding sequencing platforms, and the other two use the combined features. Internally, the experts compare allele-level evidence for each allele to the site-level evidence to determine whether the allele is noisy or not. The meta-expert uses the combined site-level feature to determine which expert it prefers at the site. We combine the outputs for the experts as follows. sequences. Hence, we do not use 2D convolutions that are used for pattern recognition in images, but 1D convolutions that are used for pattern recognition in sequences, which require less computations per layer than 2D convolutions. Convolutional layers in DeepVariant perform convolution operations not only along each read, but also across reads. Since reads may be assumed to be independently sequenced, and since DNNs operate well to capture structure in the data rather than random, independent events, we do not expect convolutions applied across reads to add useful operations and avoid them. Due to the design choices mentioned above, our model is comparatively small (< 6.5 million parameters) compared to DeepVariant (~25 million parameters). In addition, we do not use "null reads" that DeepVariant uses to pad input images to a fixed size, saving computations. DeepVariant performs predictions on allele pairs. This requires (  Illumina sequencing data as input. The calls were evaluated using the hap.py evaluation tool (Krusche, et al., 2019).
In this paper, we focus on the numbers of mistakes- Also, PacBio reads are susceptible to systematic indel errors in regions with repeat structures such as homopolymers which are found in low-complexity regions. It is interesting to see whether combining the two types of sequencing data, both susceptible to indel errors in low-complexity regions, but potentially due to different underlying causes and error models, allows us to improve the accuracy in these challenging regions.
The analyses are summarized in two sets of plots in Figure 5. Figure 5 C, G, T} in genomic regions, which allows assigning scores to the relative complexity of the region. Based on this score and a threshold, low-complexity regions may be identified. We use sdust-based lowcomplexity regions that were published previously .
Regarding Figure 5(a)-(c), there are broadly two types of low-complexity regions that are split into three categories for analysis. The first type is a set of low-complexity regions summarized in the Tandem Repeats Database, TRDB, (Gelfand, Rodriguez, & Benson, 2006). This is a database of repeats in the human genome, out of which we examine repeats where the repetitive units have an identity of 95% ( Figure 5(a)) or more. Some exact tandem repeats are not included in this database. The second set adds to the TRDB repeats mentioned above, a set of exact tandem repetitions as well. This set is further divided into two subsets -one where the repetitive region is small ( Figure 5 accuracy, and one where such context-dependent errors are not as pronounced and where the improvement is more significant. We next look at improvement in indel call accuracy with size of indel. This is presented in Figure 6. Three categories of indels are presented -indels of size 1-5bp, 6-15bp, and longer indels. The strongest improvement is observed for intermediate-sized indels (6-15bp). Illumina reads are likely to represent shorter indels more accurately than longer ones due to the relative size of short reads compared to indel size. Hence, as indel size increases, the signal strength for the indel also drops when using Illumina

Discussion
We devised a hybrid variant calling method, HELLO, that combines PacBio and Illumina reads, data from sequencing platforms with different error profiles, but some common weaknesses. In cases where both types of sequencing data are available, using our method allows improving the accuracy of small variant calls over that of the state-of-the-art variant caller as seen in our experiments. In addition, we studied the performance of our method in calling small indels in contexts of low complexity in the reference, which has been historically difficult for small variant callers. Such contexts can cause problems for both Illumina reads and PacBio reads; almost all false positive indel calls in Illumina reads arise in these regions, and PacBio reads are susceptible to context-dependent indel errors in low-complexity regions.
Our analysis indicates that using the hybrid method helps improve the accuracy in these regions, rather than compound the problem that both types of sequencing data have in these regions.

Determining hotspot regions and candidates
We iterate through reads from an alignment file counting the number of reference-matching and reference-deviating cigar strings (reference-deviating cigars include substitutions and indels). If the number of reads supporting a referencedeviating cigar at a location is high enough to warrant a closer look, we mark the location as a hotspot. This is similar to the method used in DeepVariant.
During variant call, reads aligning to a window surrounding a contiguous set of hotspots are extracted and the hotspot analysis is performed again in the region with slightly higher sensitivity. Adjacent hotspots triggered by indels are coalesced into a single site of interest. This minimizes the amount of conflict in the variants reported. The remaining hotspots are individually treated as sites of interest. The sites of interest are analyzed to determine candidate alleles and supporting reads, which are fed into the Mixture of Experts (MoE) in HELLO. A read supports a candidate allele if it contains the candidate allele aligned to the site of interest. If a read partially overlaps a site of interest, then we see whether the partial overlap may be resolved by unambiguously matching it to one of the other candidate alleles at the site. If so, the read is determined to be supporting that candidate. If not, the read is discarded.
Details of input data representation, and read filtration   Figure 1. (a) Alignment data at a site of interest, and conversion of reads into representations for the DNN. The portion of a read within a window around a site of interest is extracted and the bases, quality scores etc are converted to a sequence of vectors. Zero vectors are padded to this sequence corresponding to parts of the window that are unoccupied by the read, if the read doesn't fully occupy the window. (b) How input read representations are converted to different features for a single sequencing platform (NGS or TGS). The input representations pass through multiple stages giving different features summarizing read-level, allele-level and site-level evidence.