Skip to main content
bioRxiv
  • Home
  • About
  • Submit
  • ALERTS / RSS
Advanced Search
New Results

NanoCaller for accurate detection of SNPs and indels in difficult-to-map regions from long-read sequencing by haplotype-aware deep neural networks

Umair Ahsan, Qian Liu, Li Fang, View ORCID ProfileKai Wang
doi: https://doi.org/10.1101/2019.12.29.890418
Umair Ahsan
1Raymond G. Perelman Center for Cellular and Molecular Therapeutics, Children’s Hospital of Philadelphia, Philadelphia, PA 19104, USA
  • Find this author on Google Scholar
  • Find this author on PubMed
  • Search for this author on this site
Qian Liu
1Raymond G. Perelman Center for Cellular and Molecular Therapeutics, Children’s Hospital of Philadelphia, Philadelphia, PA 19104, USA
  • Find this author on Google Scholar
  • Find this author on PubMed
  • Search for this author on this site
Li Fang
1Raymond G. Perelman Center for Cellular and Molecular Therapeutics, Children’s Hospital of Philadelphia, Philadelphia, PA 19104, USA
  • Find this author on Google Scholar
  • Find this author on PubMed
  • Search for this author on this site
Kai Wang
1Raymond G. Perelman Center for Cellular and Molecular Therapeutics, Children’s Hospital of Philadelphia, Philadelphia, PA 19104, USA
2Department of Pathology and Laboratory Medicine, Perelman School of Medicine, University of Pennsylvania, Philadelphia, PA 19104, USA
  • Find this author on Google Scholar
  • Find this author on PubMed
  • Search for this author on this site
  • ORCID record for Kai Wang
  • For correspondence: wangk@email.chop.edu
  • Abstract
  • Full Text
  • Info/History
  • Metrics
  • Preview PDF
Loading

Abstract

Variant (SNPs/indels) detection from high-throughput sequencing data remains an important yet unresolved problem. Long-read sequencing enables variant detection in difficult-to-map genomic regions that short-read sequencing cannot reliably examine (for example, only ~80% of genomic regions are marked as “high-confidence region” to have SNP/indel calls in the Genome In A Bottle project); however, the high per-base error rate poses unique challenges in variant detection. Existing methods on long-read data typically rely on analyzing pileup information from neighboring bases surrounding a candidate variant, similar to short-read variant callers, yet the benefits of much longer read length are not fully exploited. Here we present a deep neural network called NanoCaller, which detects SNPs by examining pileup information solely from other nonadjacent candidate SNPs that share the same long reads using long-range haplotype information. With called SNPs by NanoCaller, NanoCaller phases long reads and performs local realignment on two sets of phased reads to call indels by another deep neural network. Extensive evaluation on 5 human genomes (sequenced by Nanopore and PacBio long-read techniques) demonstrated that NanoCaller greatly improved performance in difficult-to-map regions, compared to other long-read variant callers. We experimentally validated 41 novel variants in difficult-to-map regions in a widely-used benchmarking genome, which cannot be reliably detected previously. We extensively evaluated the run-time characteristics and the sensitivity of parameter settings of NanoCaller to different characteristics of sequencing data. Finally, we achieved the best performance in Nanopore-based variant calling from MHC regions in the PrecisionFDA Variant Calling Challenge on Difficult-to-Map Regions by ensemble calling. In summary, by incorporating haplotype information in deep neural networks, NanoCaller facilitates the discovery of novel variants in complex genomic regions from long-read sequencing data.

Introduction

Single-nucleotide polymorphisms (SNPs) and small insertions/deletions (indels) are two common types of genetic variants in human genomes. They contribute to genetic diversity and critically influence phenotypic differences, including susceptibility to human diseases. The detection (i.e. “calling”) of SNPs and indels is thus a fundamentally important problem in using the new generations of high-throughput sequencing data to study genome variations and genome functions. A number of methods have been designed to call SNPs and small indels on Illumina short-read sequencing data. Short reads are usually 100-150 bp long and have per-base error rate less than 1%. Variant calling methods on short reads, such as GATK [1] and FreeBayes [2], achieved excellent performance to detect SNPs and small indels in genomic regions marked as “high-confidence regions” in various benchmarking tests[3–5]. However, since these methods were developed for short-read sequencing data with low per-base error rates and low insertion/deletion errors, they do not work well on long-read sequencing data with high error rates. Additionally, due to inherent technical limitations of short-read sequencing, the data cannot be used to call SNPs and indels in complex or repetitive genomic regions; for example, only ~80% of genomic regions are marked as “high-confidence region” to have reliable SNP/indel calls in the Genome In A Bottle (GIAB) project, suggesting that ~20% of the human genome is inaccessible to conventional short-read sequencing technologies to find variants reliably.

Oxford Nanopore [6] and Pacific Biosciences (PacBio) [7] technologies are two leading long-read sequencing platforms, which have been rapidly developed in recent years with continuously decreased costs and continuously improved read length, in comparison to Illumina short-read sequencing technologies. Long-read sequencing techniques can overcome several challenging issues that cannot be solved using short-read sequencing, such as calling long-range haplotypes, identifying variants in complex genomic regions, identifying variants in coding regions for genes with many pseudogenes, sequencing across repetitive regions, phasing of distant alleles and distinguishing highly homologous regions [8]. To date, long-read sequencing techniques have been successfully used to sequence genomes for many species to powerfully resolve various challenging biological problems such as de novo genome assembly [9–13] and SV detection [14–19]. However, the per-base accuracy of long reads is much lower with raw base calling errors of 3-15% [20] compared with short-read data. The high error rate challenges widely-used variant calling methods (such as GATK [1] and FreeBayes [2]), which were previously designed for Illumina short reads and cannot handle reads with higher error rates. It is also worth noting that (1) HiFi reads after circular consensus sequencing on PacBio long-read sequencing [21] or similar methods on the Nanopore platform can potentially improve the detection of SNPs/indels by adapting existing short-read variant callers, due to its much lower per-base error rates. However, HiFi reads would substantially increase sequencing cost given the same base output, so it may be more suitable for specific application scenarios such as capture-based sequencing or amplicon sequencing. As more and more long-read sequencing data becomes available, there is an urgent need to detect SNPs and small indels to take the most advantage of long-read data.

Several recent works aimed to design accurate SNP/indel callers on long-read sequencing data using machine learning methods, especially deep learning-based algorithms. DeepVariant [22] is among the first successful endeavor to develop a deep learning variant caller for SNPs and indels across different sequencing platforms (i.e. Illumina, PacBio and Nanopore sequencing platforms). In DeepVariant, local regions of reads aligned against a variant candidate site were transformed into an image representation, and then a deep learning framework was trained to distinguish true variants from false variants that were generated due to noisy base calls. DeepVariant achieved excellent performance on short reads as previous variant calling methods did. Later on, Clairvoyante [23] and its successor Clair [24] implemented variant calling methods using deep learning, where the summary of adjacently aligned local genomic positions of putative candidate sites were used as input of deep learning framework. The three deep learning-based methods can work well on both short-read and long-read data, but they do not incorporate haplotype structure in variant calling; these methods consider each SNP separately, while a recent testing [21] with DeepVariant has shown that a phased BAM with haplotype-sorted readscan improve variant calling accuracy because long reads likely from same haplotype benefits neural network learning from an image of read pileup. However, this testing underutilizes the rich haplotype information from long reads, especially when it is explicitly provided in a phased BAM file as input. Moreover, enough variants need to be known beforehand to phase a BAM file. Two recent works have endeavored to improve variant calling by using phasing information from long-readsequencing data. Longshot [25] uses a pair-Hidden Markov Model (pair-HMM) for a small local window around candidate sites to call SNPs on long-read data, and then improves genotyping of called SNPs using HapCUT2 [26] based on the mostly pair of haplotypes given the current variant genotypes. However, Longshot cannot identify indels. The Oxford Nanopore Technologies company also recently released a SNP/indel caller, i.e. Medaka [27], using deep learning on long-read data. Although not published, based on its GitHub repository, Medaka first predicts SNPs from unphased long reads, and then uses WhatsHap to phase reads. Medaka finally makes SNP and indel calling for each group of phased reads. In both methods, mutual information from long-range haplotype SNPs is ignored. In summary, although several methods for variant detection on long-read sequencing data have become available, there may be room in further improving these approaches especially for difficult-to-map regions. We believe that improved SNP/indel detection on long read data will enable widespread research and clinical applications of long-read sequencing techniques.

In this study, we propose a deep learning framework, NanoCaller, which integrates long-range haplotype structure in a deep convolutional neural network to improve variant detection on long-read sequencing data. It only uses haplotype information in SNP calling (without requiring a phased BAM alignment input) and generates input features of a SNP candidate site using long-range heterozygous SNPs sites, that are fed into a deep convolutional neural network for SNP calling. Please note that those long-range heterozygous SNPs sites can be hundreds or tens of thousands of bases away and NanoCaller does not use local neighboring bases, which is substantially different from DeepVariant, Clairvoyante [23] and its successor Clair [24], as well as Longshot and Medaka where local neighboring bases of SNP sites were used. After that, NanoCaller uses these predicted SNP calls to phase alignment reads with WhatsHap for indel calling. Local multiple sequence alignment of phased reads around indel candidate sites is used to generate consensus sequence and feature inputs for a deep convolutional neural network to predict indel variant zygosity. We assess NanoCaller on several human genomes, HG001 (NA12878), HG002 (NA24385), HG003 (NA24149), HG004 (NA24143) and HX1 with both Nanopore and PacBio long-read data. In particular, for the Ashkenazim trio (HG002, HG003 and HG004), we evaluate NanoCaller in difficult-to-map genomic regions to investigate the unique advantages provided by long reads. Our evaluation demonstrates competitive performance of NanoCaller against existing tools, with particularly improved performance in complex genomic regions which cannot be reliably called on short-read data. NanoCaller is publicly available at https://github.com/WGLab/NanoCaller.

Results

Overview of NanoCaller

NanoCaller (Figure 1) takes the alignment of a long-read sequencing data aligned against a reference genome as input, and generates a VCF file for called SNPs and indels. For SNP calling in NanoCaller, candidate SNP sites are selected according to the specified thresholds for minimum coverage and minimum frequency of alternative alleles (a fraction of them are likely to be false positives given the relaxed thresholds for candidate identification). Long-range haplotype features for the candidate sites (Figure 2) are generated and fed to a deep convolutional network to distinguish true variants from false candidate sites. The predicted SNPs are phased and then used in the identification of indels together with the long-read alignments. Indel candidate sites are selected according to specified minimum coverage and insertion/deletion frequency thresholds applied to each phased read set. Input features for indel candidate sites are generated using multiple sequencing alignment on the set of entire reads and on each set of phased reads (Figure 3). After that, another deep convolutional neural network is designed to determine indel calls, and allele sequence for the indels is predicted by comparing consensus sequences against reference sequence.

Figure 1.
  • Download figure
  • Open in new tab
Figure 1. The deep-learning framework for SNP and indel calling.

a) An illustration of the convolutional neural network model for SNP calling; b) An illustration of the convolutional neural network model for indel calling. In both models, first convolutional layer uses 3 kernels of sizes 1×5, 5×1 and 5×5, whereas the second and third convolutional layers use kernels of size 2×3. Output of third convolutional layer is flattened and fed to a fully connected layer followed by a hidden layer with 48 nodes with a 50% dropped rate. In a), output of the hidden layer is split into two independent pathways: one for calculating probabilities of each base and the other for calculating zygosity probabilities. Zygosity probability is only used in the training process. In b), output of first fully connected layer is fed into two fully connected hidden layers to produce probabilities of four possible cases of zygosities.

Figure 2.
  • Download figure
  • Open in new tab
Figure 2. An example on how to construct image pileup for a SNP candidate site.

a) reference sequence and reads pileup at site b and sites in set Z; b) raw counts of bases at sites in Z in each read group split by the nucleotide types at site b; c) frequencies of bases at sites in Z with negative signs for reference bases; d) flattened pileup image with fifth channel after reference sequence row is added; e) pileup image used as input for NanoCaller.

Figure 3.
  • Download figure
  • Open in new tab
Figure 3. An example on how to construct image pileup for an indel site.

a) reference sequence and reads pileup at the candidate site before and after multiple sequence alignment, and the consensus sequence after realignment; b) reference sequence and consensus sequence at the candidate site before and after pairwise alignment, and the inferred sequence; c) shows raw count of each symbol at each column of multiple sequence alignment pileup; d) matrix M, showing frequency of each symbol at each column of multiple sequence alignment pileup; e) first channel of input image, matrix M minus Q (one-hot encoding of realigned reference sequence); f) matrix Q, one-hot encoding of realigned reference sequence which forms the second channel of input image.

The performance of NanoCaller is evaluated on both Oxford Nanopore and PacBio reads, and compared with performances of Medaka (v0.10.0), Clair (v2.0.1), Longshot (v0.4.1), DeepVariant (v.1.0.0) and WhatsHap (v1.0) with their default parameters for each type of sequencing technology. The evaluation is on the GRCh38 reference genome by default for benchmark variants in high-confidence regions unless stated otherwise. RTG tools (the vcfeval submodule) [28] is used to calculate various evaluation measurements, such as precision, recall and F1. For whole genome analysis, we show each variant caller’s performance using its recommended quality threshold if available, e.g. NanoCaller, Clair, Longshot and DeepVariant; for Medaka, we calculate the quality score thresholds that give highest F1 score for each genome using vcfeval, and use their average as the final quality score cut-off to report results. More discussion of various quality thresholds for NanoCaller and other variant callers is provided in Supplementary Tables S13–S16. In particular, variant calling performance analysis in difficult-to-map genomic regions requires different quality score cut-offs from whole genome analysis due to highly specific error profiles of these difficult-to-map regions. Therefore, we use the average best quality score cut-off (in the same manner as the quality score cut-off is determined for Medaka) for each variant caller in each type of difficult genomic regions. Below, we present performances of three NanoCaller SNP models: NanoCaller1 (trained on HG001 ONT reads), NanoCaller2 (trained on HG002 ONT reads), and NanoCaller3 (trained on HG003 PacBio CLR dataset), and two NanoCaller indel models: ONT indel model (trained on HG001 ONT reads) and PacBio indel model (trained on HG001 PacBio CCS reads). All NanoCaller models were trained using v3.3.2 of GIAB benchmark variant calls using reads aligned to the GRCh38 reference genome.

Evaluation of NanoCaller on Oxford Nanopore Sequencing

Performance on SNP calling

We evaluated NanoCaller on Oxford Nanopore sequencing reads against several existing tools. The settings of compared SNP callers are given below: NanoCaller was trained on v3.3.2 GIAB benchmark variants, and for performance evaluation we usedalternative allele frequency threshold of 0.15 for SNP candidates, the Clair 1_124x ONT model was trained on 124x coverage HG001 ONT reads using v3.3.2 GIAB benchmark variants, Longshot pair-HMM model is not trained on any genome as it estimates parameters during each run, while Medaka ‘r941_min_diploid_snp_model model is used for testing, which is trained using several bacteria and eukaryotic read datasets and variant call sets. We compared the performance of those method on five genomes: HG001, HG002, HG003, HG004 and HX1 under two testing strategies: cross-genome testing and cross-reference testing.

Cross-genome testing is critical to demonstrate the performance of a variant caller when used in a real-world scenario: the machine learning model of a variant caller is trained on a set of genomes and tested on other genomes. Under this testing strategy, the performance of SNP calling for NanoCaller, together with four other variant callers, Medaka, Clair, Longshot and WhatsHap, is shown in Table 2 and Figure 4 (a), (b) and (c) on Nanopore reads of five genomes. We have evaluated these performances using the latest available GIAB benchmark variants for each genome, i.e. v3.3.2 for HG001 and v4.2 for the Ashkenazim trio).

Figure 4.
  • Download figure
  • Open in new tab
Figure 4. Performance of NanoCaller and state-of-the-art variant callers on five whole-genome Oxford Nanopore sequencing data sets.

The performance of SNP predictions on ONT reads: a) precision, b) recall, c) F1 score. d) The performance of SNP predictions on HG002 (ONT), HG003 (ONT) and HG004 (ONT) in difficult-to-map genomic regions. The performance of variant predictions on HG002 (ONT), HG003 (ONT) and HG004 (ONT) in Major Histocompatibility Complex regions: e) SNPs, f) overall variants. The performance of indel predictions on ONT reads in non-homopolymer regions: g) precision, h) recall, i) F1 score. For HX1, the variants, which were called on high-coverage short-read data are used as benchmark with complement of difficult-to-map genomic regions used in d) as high-confidence regions. Benchmark variants v3.3.2 for HG001 and v4.2 for the Ashkenazim trio (HG002, HG003, HG004) are used for evaluation.

On the Ashkenazim trio (HG002, HG003, HG004), NanoCaller has better performance than Clair and Longshot in terms of both recall and F1-score (F1-scores are NanoCaller1: 97.99, 98.88, 98.77%; NanoCaller2: 97.97, 98.88, 98.82% vs Clair: 97.76%, 98.49, 98.55% and Longshot: 98%, 97.87%, 97.84%, on HG002, HG003 and HG004 respectively), with even higher margins for recall (recalls are NanoCaller1: 97.99%, 98.73%, 98.69%; NanoCaller2: 98.03%, 98.61%, 98.61% vs Clair: 97.28%, 97.94%, 98.11% and Longshot: 97.09%, 96.83%, 96.75% on HG002, HG003 and HG004 respectively), whereas Medaka performs better than NanoCaller in this regard. All methods show similar precision on the trio, except that Longshot has much higher precision on HG002 than other methods. We also evaluated the SNP callings for the four methods on older benchmark v3.3.2 and reads basecalled with older Nanopore basecallers for the Ashkenazim trio HG002-4, as shown in Supplementary Tables S2 and S3. We find that all methods have improved performance; this suggests that improved Nanopore basecalling and increasing number of benchmark variants enhance the evaluation of variant calling: as shown in Table 1, the Ashkenazim trio HG002-4 has a larger variant call set than HG001 (370-400k more SNPs per genome than HG001), and a larger high-confidence region which includes more difficult genomic regions (at least 200mbp larger than HG001 and covering an extra 7% of the reference genome).

View this table:
  • View inline
  • View popup
  • Download powerpoint
Table 1.

Statistics of benchmark variants in chromosomes 1-22 of each genome aligned to the GRCh38 reference genome. Four genomes with GIAB benchmark variant calls, with v3.3.2 for HG001, and v4.2 for HG002, HG003 and HG004, statistics within the high confidence regions are also given. For HX1, high confidence regions are created by removing GIAB low complexity regions from the GRCh38 reference genome.

We also show that the performance of NanoCaller SNP models is independent of the reference genome used. Under this cross-reference testing, we evaluate NanoCaller1 model (trained on reads aligned against GRCh38) on HG002 ONT reads aligned to both GRCh38 and GRCh37 reference genomes, and calculate the performance using v3.3.2 benchmark variants for HG002, the latest version for which we have benchmark variant calls for GRCh37. For reads aligned to GRCh38, we obtained a precision of 97.49%, a recall of 97.97% and an F1 score of 97.73%, whereas for reads aligned to GRCh37, we obtain a precision of 97.88%, a recall of 98.02% and an F1 score of 97.95%. The competitive performance on GRCh38 and GRCh37 indicates that NanoCaller could be used on alignment generated by mapping to different reference genomes.

Performance on SNP calling in difficult-to-map genomic regions

We further demonstrate that NanoCaller has a unique advantage in calling SNPs in difficult-to-map genomic regions. To have an unbiased testing, we evaluated NanoCaller1 SNP model (trained on HG001 ONT reads with v3.3.2 benchmark variants) on ONT reads of the three genomes of the Ashkenazim trio together with other variant callers, while v4.2 of GIAB benchmark variants for the trio were used for testing with more exhaustive list of true variants and high confidence intervals in difficult-to-map genomic regions. Difficult-to-map genomic regions here are defined by GA4GH Benchmarking Team and the Genome in a Bottle Consortium, and downloaded as BED files from GIAB v2.0 genome stratification.

These regions contain all tandem repeats, all homopolymers >6bp, all imperfect homopolymers >10bp, all low mappability regions, all segmental duplications, GC <25% or >65%, Bad Promoters, and other difficult regions such Major Histocompatibility Complex. We intersected the BED files with high-confidence intervals for each genome in the trio and evaluated SNP performance in the intersected regions. As shown in Table 2, each genome has at least 600k SNPs in the intersection of difficult-to-map regions and high-confidence intervals, which is a significant fraction (18-19%) of all SNPs in the high-confidence regions. The evaluation on these SNPs is shown in Figure 4 (d) and Table 2: NanoCaller1 (F1 score 95.60%, 96.78% and 96.66% for HG002, HG003 and HG004 respectively) performs better than all other variant callers for each genome. F1 scores of NanoCaller exceed Medaka’s, Clair’s and Longshot’s F1 scores by 0.45%, 0.51% and 1.98% on average. In Supplementary Tables S6–S11, we further show a detailed breakdown of performances in the difficult-to-map regions and demonstrate that NanoCaller performs better than other variant callers for SNPs in each of the following difficult-to-map regions: segmental duplications (Supplementary Tables S9), tandem and homopolymer repeats (Supplementary Tables S10), low mappability regions (Supplementary Tables S8), and Major Histocompatibility Complex (Supplementary Tables S11).

View this table:
  • View inline
  • View popup
  • Download powerpoint
Table 2.

Performances (Precision/Recall/F1 percentages) of SNP and indel predictions by NanoCaller1, NanoCaller2 and NanoCaller3 on ONT and PacBio (CCS and CLR) data and on difficult-to-map genomic regions along with the performance of existing variant callers. These evaluation is bsed on v3.3.2 benchmark variants for HG001 and v4.2 benchmark variants for the Ashkenazim trio (HG002, HG003, HG004).

NanoCaller team participated in PrecisionFDA truth challenge v2 for difficult-to-map genomic regions (held in July 2020, see https://precision.fda.gov/challenges/10), and submitted variant calls for the Ashkenazim trio made by an ensemble of NanoCaller, Medaka and Clair models described above. The challenge consisted of variant calling evaluation on GIAB v4.1 benchmark variants of HG003 and HG004 that were made public after the challenge ended, and at the conclusion of the challenge GIAB released v4.2 benchmark variants. Our ensemble submission won the award for best performance in Major Histocompatibility Complex (MHC) using Nanopore reads [29]. Figure 4 (e) and (f) show the F1 score of SNPs and overall variants performance of the ensemble, NanoCaller, Medaka and Clair. While the ensemble performs better than all other variant callers in general, NanoCaller’s performance on HG002 and HG004 is very close to the ensemble and significantly better than the performances of Medaka and Clair (F1-scores NanoCaller: 98.53%, 99.07% vs ensemble: 98.97%, 99.17%; Medaka: 97.15%, 94.29% and Clair: 97.55%, 98.59% for HG002 and HG004 respectively). NanoCaller always outperforms Longshot for SNP calling in MHC regions. Therefore, this is an independent assessment of the real-world performance of NanoCaller in detecting variants in complex genomic regions.

Performance on Indel Calling

We train NanoCaller indel ONT model on chromosomes 1-22 of HG001 (ONT) with v3.3.2 benchmark variants, and test it on ONT reads of four genomes (HG001, HG002, HG003, HG004) against v3.3.2 benchmark variants for HG001 and v4.2 benchmark variants for the Ashkenazim trio. The settings of this evaluation is given below: haplotype insertion allele frequency at 0.4 and deletion frequency at 0.6 are used to determine indel candidates due to abundance of deletion errors in Nanopore reads. For each genome, we create evaluation regions by removing low complexity homopolymer repeat regions from GIAB high confidence regions, and evaluate the indel performance by RTG vcfeval. These low complexity regions consist of perfect homopolymer regions of length greater than or equal to 4bp as well as imperfect homopolymer regions of length greater than 10bp regions provided in GIAB’s v2.0 genome stratification. The results are shown in Table 2 and Figure 4 (g), (h) and (1) for NanoCaller together with Medaka and Clair. According to the F1 scores in Table 2, NanoCaller generally performs better than Clair (~10% on HG002, ~7% on HG003 and ~6% on HG004) and competitively against Medaka for HG002 to HG004. It is also worth noting that NanoCaller has a higher recall than Medaka and especially Clair: for example, NanoCaller has ~18% and ~1.8% higher recall than Clair and Medaka respectively on each of the three genomes of the trio. This suggests that NanoCaller complements other existing methods well to identify more indels. Supplementary Figure S1 further shows concordance of ground truth variants in high-confidence regions (including homopolymer repeat regions) of the Ashkenazim trio correctly predicted by NanoCaller, Medaka and Clair. Figure S1 (b) shows each tool has a significant number (ranging from 19k-60k) of correctly predicted indel calls, that are not correctly predicted by other variant callers.

Performance on PacBio Sequencing data

On PacBio datasets of four genomes: HG001, HG002, HG003 and HG004, we evaluated NanoCaller SNP models NanoCaller1 and NanoCaller3 against v3.3.2 benchmark variants for HG001 and v4.2 benchmark variants for the trio. The settings of compared tools are given below: in NanoCaller, minimum alternative allele frequency threshold for SNP calling at 0.15 was used to identify SNP candidates for both PacBio CCS and CLR reads, and NanoCaller models are trained with v3.3.2 benchmark variants. For Clair, the PacBio model trained on HG001 and HG005 was used for testing CCS reads, whereas the model trained on seven genomes HG001-HG007 was used for testing CLR reads; both Clair models used v3.3.2 benchmark variants for training; The provided PacBio model in the new DeepVariant release v1.0.0 is used for testing, and this model was trained on CCS reads of HG001-HG006 with v3.3.2 benchmark variants for HG001, HG005-6 and v4.2 for HG002-HG004.

The results for SNP performance on CCS reads are shown in Table 2 and Figure 5 (a), (b) and (c) along with Clair, Longshot, DeepVariant and WhatsHap. It can be seen from Table 2 and Figure 5 (a), (b) and (c) that 1) both NanoCaller1 and NanoCaller3 have similar performance according to F1 scores, although NanoCaller1 model is trained on Nanopore dataset while NanoCaller3 model is trained on PacBio CLR dataset; thus, NanoCaller is able to perform well cross different sequencing platform; 2) NanoCaller outperforms Longshot, and 3) NanoCaller shows competitive performance against Clair (99.77%, 99.76%, 99.74% on the trio) and WhatsHap (99.12%, 99.66%, 99.66% on the trio) with difference < 0.3%.

Figure 5.
  • Download figure
  • Open in new tab
Figure 5. Performance of NanoCaller and state-of-the-art variant callers on PacBio sequencing data sets.

The performance of SNP predictions on PacBio CCS reads: a) precision, b) recall, c) F1 score. The performance of indel predictions on PacBio CCS reads: d) precision, e) recall, f) F1 score. The performance of SNP predictions on PacBio CLR reads: g) precision, h) recall, i) F1 score. Benchmark variants v3.3.2 for HG001 and v4.2 for the Ashkenazim trio (HG002, HG003, HG004) are used for evaluation.

We also evaluated NanoCaller Pacbio indel model on CCS datasets of the Ashkenazim trio: HG002, HG003 and HG004, and the results are shown in Table 2 and Figure 5 (d), (e) and (f) along with Clair and DeepVariant indel performance. The F1-scores on the trio suggest that NanoCaller performs competitively against Clair. As expected, DeepVariant performs very well on CCS reads because CCS reads have much lower error rates.

With Continuous Long Read Sequencing (CLR) datasets, we evaluated NanoCaller SNP models NanoCaller1 and NanoCaller3 on HG001 (reads aligned to GRCh37) and the Ashkenazim trio: HG002, HG003 and HG004 (as shown in Table 2 and Figure 5 (g), (h) and (i)). Due to drastic differences in coverage of CLR datasets, we used a higher NanoCaller quality score cut-off for HG003 and HG004, compared to HG001 and HG002. NanoCaller shows similar performance for the Ashkenazim trio in terms of F1-score (NanoCaller1: 98.26%, 94.33%, 93.47% vs Clair: 98.38%, 94.89%, 94.15% and Longshot: 98.41%, 94.35%, 93.27%), but has worse on HG001.

Novel Variants called by NanoCaller

We also analyzed SNP calls made by NanoCaller on HG002 (ONT reads) that are absent in the GIAB ground truth calls (version 3.3.2) [30], and validated 17 regions of those SNP calls by Sanger sequencing before v4 benchmark for HG002 was made available (Sanger sequencing signals along with inferred sequences of both chromosomes for each region are shown in the supplementary folder of “Sanger Sequencing Files” and Sanger_sequences.xlsx). By deciphering Sanger sequencing results, we identified 41 novel variants (25 SNPs, 10 insertions and 6 deletions), as shown in Table 4. Based on the 41 novel variants, we conducted the variant calling evaluation by different methods on both older ONT HG002 reads and newly released ONT HG002 reads (as described in the Methods section) to see how more accurate long reads improve variant calling. We find that (1) on the newly released ONT HG002 reads, Medaka correctly identified 15 SNPs, 6 insertions and 2 deletions, Clair identified 14 SNPs, 6 insertions and 2 deletions, and Longshot correctly identified 18 SNPs, while NanoCaller was able to correctly identify 20 SNPs,6 insertions and 2 deletions, as shown in the Supplementary Table S12, whereas one of these 2 deletions was not called correctly by other variant callers; and (2) on the older ONT HG002 reads, as shown in Table 4, Medaka correctly identified 8 SNPs, 3 insertions and 1 deletion, and Clair identified 8 SNPs, 2 insertions and 1 deletion, whereas Longshot correctly identified 8 SNPs; In contrast, NanoCaller was able to correctly identify 18 SNPs and 2 insertions, whereas 10 of these 18 SNPs and 1 of these 2 insertions were not called correctly by other variant callers on the older HG002 (ONT reads). This indicates that the improvements in per-base accuracy during base-calling significantly enhances the variant calling performance. Also in Table 4, there are 2 multiallelic SNPs which can be identified by NanoCaller but cannot be correctly called by all other 3 methods. One of the multiallelic SNPs at chr3:5336450 (A>T,C) is shown in Figure 6, where both the IGV plots and Sanger results clearly show a multiallelic SNP that was correctly identified by NanoCaller but was missed by other variant callers, likely due to the unique haplotype-aware feature of NanoCaller. In summary, the prediction on these novel variants clearly demonstrate the usefulness of NanoCaller for SNP calling.

Figure 6.
  • Download figure
  • Open in new tab
Figure 6. Evidence for novel multiallelic SNP.

a) IGV plots of Nanopore, PacBio CCS and Illumina reads of HG002 genome at chr3:5336450-5336480. b) Sanger sequencing signal data for the same region. NanoCaller on older HG002 data correctly identified the multi-allelic SNP at chr3:5336450 (A>T,C) shown in black box.

To demonstrate the performance of NanoCaller for indel calling, we use Figure 7 to illustrate those variants that can be detected by long-read variant callers but cannot be detected by short-read data. In Figure 7, the validated deletion is at chr9:135663795 or chr9:135663804 (there are two correct alignments at the deletion and thus both genomic coordinates are correct.). NanoCaller detects the deletion at chr9:135663805, while Medaka and Clair detect the deletion at chr9:135663799. Although they are several bps away from the expected genomic coordinates (which is normal in long-read based variant calling), the prediction provides accurate information of the deletion compared with short-read data where little evidence supports the deletion as shown in Figure 7 (a). Sanger sequencing signal data, shown in Figure 7 (b), confirms the presence of a heterozygous deletion at the same location which is causing frameshift between the signals from maternal and paternal chromosomes. This is an example to demonstrate how long-read variant callers on long-read data can detect variants that fail to be reliably called on short-read sequencing data.

Figure 7.
  • Download figure
  • Open in new tab
Figure 7. Evidence for novel deletions.

a) IGV plots of Nanopore, PacBio CCS and Illumina reads of HG002 genome at chr9:135663780-chr9:135663850. The 40bp long deletion shown below in black box was identified using Sanger sequencing at chr9:135663795 or chr9:135663804 (both are correct and the difference is due to two different alignments). b): Sanger sequencing signal data around the deletion.

NanoCaller Runtime Comparison

We assessed NanoCaller’s running time in four modes: ‘snps_unphased’, ‘snps’, ‘indels’, and ‘both’. In ‘snps_unphased’ mode, NanoCaller uses deep neural network model to predict SNP calls only, whereas in the ‘snps’ mode, NanoCaller SNP calling is followed by an additional step of phasing SNP calls by external haplotyping tools such as WhatsHap. In the ‘indels’ mode, NanoCaller uses phased reads in a BAM input to predict indels only. The entire NanoCaller workflow is the ‘both’ mode, where NanoCaller first runs in ‘snps’ mode to predict phased SNP calls, then uses WhatsHap to phase reads with the SNP calls from the ‘snps’ mode, followed by running ‘indels’ mode on phased reads from the previous step. Table 3 shows the wall-clock runtime of each mode of NanoCaller using 16 CPUs (IntelXeon CPU E5-2683 v4 @ 2.10GHz) on 49x HG002 ONT, 35x PacBio CCS and 58x PacBio CLR reads. NanoCaller takes ~18.4 hours and ~2.8 hours to run ‘both’ and ‘snps_unphased’ modes on 49x HG002 ONT reads, compared to ~181.6 hours for Medaka and ~5.6 hours for Clair, on the same 16 CPUs. On 35x CCS reads, NanoCaller takes ~11.2 hours and ~2.7 hours to run to run ‘both’ and ‘snps_unphased’ modes, compared to ~1.8 hours by Clair and ~11.8 by DeepVariant, on 16CPUs. NanoCaller usually runs faster than other tools. We summarize the runtime of all variant callers in Supplementary Table S17.

View this table:
  • View inline
  • View popup
  • Download powerpoint
Table 3.

Wall-clock runtime in hours of different modes of NanoCaller using 16 CPUs on 49x ONT, 35x CCS and 58x CLR reads of HG002.

View this table:
  • View inline
  • View popup
Table 4.

Novel variants in HG002 genome, missing in v3.3.2 benchmark variant, discovered by Sanger sequencing together with the prediction information by NanoCaller and other variant callers using ONT reads basecall with Guppy 2.3.4.

Please note that Medaka’s first step also produces unphased SNP calls using a recurrent neural network on mixed haplotypes (Medaka later uses WhatsHap to phase SNP calls and reads for haplotype separated variant calling). Compared with Medaka’s first step, NanoCaller’s unphased SNP calling not only takes a fraction of the time required for Medaka’s first step (~2.8 hours vs ~70.7 hours), but also gives much better performance (precision, recall and F1-score on HG002 are NanoCaller: 98%, 97.99%, 97.99% vs Medaka 98.01%, 92.16%, 94.99%). Similarly, Longshot’s first step uses a pair-HMM model to produce SNP calls from mixed haplotypes (Longshot later uses HapCUT2 to update the genotypes of these SNP calls in an iterative manner); on HG002 ONT, Longshot’s first step takes 15.2 hours with 93.01%, 95.69%, and 94.33% precision, recall and F1-score respectively. Longshot and WhatsHap cannot use multiple CPUs to produce on SNP calls. With single CPU on 49x HG002 ONT reads, Longshot needs ~49.7 hours and WhatsHap needs ~84.3 hours for SNP calling.

Effects of Various Parameters on NanoCaller’s Performance

Strategies of choosing heterozygous SNPs for SNP features generation

In NanoCaller, we generate input features for a SNP candidate site by choosing potentially heterozygous SNP sites that share a read with the candidate site. In the implementation of NanoCaller, at most 20 heterozygous SNP candidates are chosen for downstream and upstream of a candidate site of interest. With an expectation that 1 SNP occurring per 1000bp, a simple way is to include 20kb downstream and upstream sequence centered at the candidate site of interest, and then select 20 nearest heterozygous SNP sites. But in some smaller genomic regions, a cluster of heterozygous SNP candidates may be found but due to the noise with many false positives, and these false SNPs in a very smaller regions would provide a strong co-occurrence evidence for each other but little information for the candidate site of interest. Longshot notices this issue and overcomes high false positive rates due to dense clusters of false positive SNP by simply removing these dense clusters if the number of SNP calls exceeds a certain threshold in a specified range; however applying hard limits like that can lead to missing out on true SNPs that do occur in dense clusters in certain genomic regions.

We decide to use a difference method for selecting nearby potential heterozygous sites by forcing NanoCaller to pick a certain number of site that were some distance away from the candidate site. More precisely, we force NanoCaller to pick 2, 3, 4, 5 and 6 heterozygous SNP sites between the following distances from the candidate site: 0, 2kbp, 5kbp, 10,kbp, 20kbp and 50kbp. This is illustrated in Figure S3 and Table S18 of Supplementary Materials. We found that using this method, we achieve better SNP calling performance for ONT reads; Supplementary Materials Table S21 shows that under this strategy, for each genome in the Ashkenazim trio, we achieve higher precision, recall and F1 score for whole genome analysis as well as in each difficult-to-map genomic regions. On the other hand, SNP calling performance of PacBio CCS and CLR reads is not effected by using this method of selecting heterozygous SNP sites, as shown in Supplementary Materials Table S22. This might be due to the fact that ONT reads have significantly higher N50 and mean read lengths compared to PacBio CCS and CLR reads, as shown in Supplementary Materials Table S1. Supplementary Materials Figure S2 shows the read lengths distribution of HG004 ONT, CCS and CLR datasets of 88X, 35X and 27X coverages respectively. In these datasets, 99.4% of CCS reads and 97% of CLR reads are shorter than 20,000bp; on the other hand, only 69.3% and 87.4% of the ONT reads are shorter than 20,000bp and 50,000bp. This simple comparison clearly demonstrates that NanoCaller is able to utilize the longer reads to improve SNP calling, and comparatively shorter PacBio reads might in part contribute to the less improvement of NanoCaller. Thus, as read length increases, we expect NanoCaller can have better performance.

Different thresholds for heterozygous SNP sites for SNP features generation

In order to generate haplotype structure features from long reads for a SNP candidate site, we need to select potentially heterozygous sites. Ideally heterozygous sites should have approximately 0.5 alternative allele frequency, which is rarely the case due to alignment and sequencing errors. Therefore, a SNP candidate site is determined to potentially heterozygous if its alternative allele frequency is in a small range centered at 0.5: typically this range is 0.4-0.6 or 0.3-0.7 depending upon the sequencing technology and is called neighbor threshold. Table S19 of Supplementary Materials shows how the choice of this threshold effects SNP calling performance for ONT, PacBio CCS and CLR reads, which have different characteristics of error rates and read lengths (which in turn determines the number of candidate sites). We can observe that, generally, using a narrower range around 0.5% allows higher precision, but recall decreases because not enough heterozygous sites are chosen to give informative features. In particular, the performance is very sensitive to increases in the upper limit of threshold, and decreases drastically when the threshold is increased. We determined that 0.4-0.6 threshold works best for ONT reads, with 0.3-0.7 and 0.3-0.6 being the best thresholds for CCS and CLR reads. Using a narrower threshold for ONT reads makes sense since longer ONT reads give us plenty of heterozygous sites to choose from, compared to CLR or CCS reads. It should be noted that this threshold is used for testing a sequencing data only, and during training of NanoCaller SNP models, we simply use benchmark heterozygous SNPs.

We check how minimum numbers of heterozygous SNP candidates for NanoCaller affect the performance and show the result in Table S20 of Supplementary materials. In NanoCaller, a SNP candidates with less than a minimum number of heterozygous SNP candidates will be considered as false negatives without prediction. By default, a minimum number of heterozygous SNP candidates is 1. In Table S20 of Supplementary materials, different minimum numbers of heterozygous SNP candidates are checked on Nanopore reads and Pacbio reads. On both data, as this minimum threshold increase, precision increases and recall decreases. The increasing precision suggests that more heterozygous SNP candidates can benefit SNP prediction.

Using WhatsHap ‘distrust genotype’ option for phasing

WhatsHap is able to call SNPs on ONT and PacBio reads, as shown in Table 2. WhatsHap shows similar performance to NanoCaller, albeit much slowly, while for ONT reads, WhatsHap shows poor performance with F1-scores around 88-93%. In NanoCaller, WhatsHap is used for phasing SNPs and reads but not for variant calling. Further, WhatsHap allows ‘distrust genotype’ setting for phasing which allows WhatsHap to change genotypes of any SNP, from hetero-to homozygous and vice versa, in an optimal likelihood solution based upon the haplotypes created. In NanoCaller, this setting is disabled by default.

If users of NanoCaller wants to use ‘distrust genotype’ setting in WhatsHap, negligible effect of SNP calling performance is expected on ONT reads, while an increase in F1-score of 0.15-0.5% and 0.171.4% are expected on PacBio CCS and CLR reads, as shown in Table S23 of Supplementary Materials. But please note that using this setting significantly increased the runtime.

Discussion

In this study, we present NanoCaller, a deep learning framework to detect SNPs and small indels from long-read sequencing data. Depending on library preparation and sequencing techniques, long-read data usually have much higher error rates than short-read sequencing data, which poses a significant challenge to variant calling and thus stimulates the development of error-tolerant deep learning methods for accurate variant calling. However, the benefits of much longer read length of long-read sequencing are not fully exploited for variant calling in previous studies. The NanoCaller tool that we present here solely integrates haplotype structure in deep convolutional neural network for the detection of SNPs from long-read sequencing data, and uses multiple sequence alignment to re-align indel candidate sites to generate indel calling. Our evaluations under the cross-genome testing, cross-reference genome testing, and cross-platform testing demonstrate that NanoCaller performs competitively against other long-read variant callers, and outperforms other methods in difficult-to-map genomic regions.

NanoCaller has several advantages to call variants from long-read sequencing data. (1) NanoCaller uses pileup of candidate SNPs from haplotyped set of long-range heterozygous SNPs (with hundreds or thousands bp away rather than adjacent neighborhood local region of a candidate SNP of interest), each of which is shared by a long read with the candidate site. Given a long read with >20kb, there are on averagely >20 heterozygous sites, and evidence of SNPs from the same long reads can thus improve SNP calling by deep learning. Evaluated on several human genomes with benchmarking variant sets, NanoCaller demonstrates competitive performance against existing variant calling methods on long reads and with phased SNPs. (2) NanoCaller is able to make accurate predictions cross sequencing platforms and cross reference genomes. In this study, we have tested NanoCaller models trained on Nanopore data for performance. We also test NanoCaller models calling variants on PacBio long-read data and achieved similar prediction trained on GRCh38 for GRCh37 and achieve the same level SNP calling performance. (3) With the advantage of long-read data on repetitive regions, NanoCaller is able to detect SNPs/indels outside high-confidence regions which cannot be reliably detected by short-read sequencing techniques, and thus NanoCaller provides more candidate SNPs/indels sites for investigating causal variants on undiagnosed diseases where no disease-causal candidate variants were found by short-read sequencing. (4) NanoCaller has flexible design to call multi-allelic variants, which Clairvoyante and Longshot cannot handle. In NanoCaller, the probability of each nucleotide type is assessed separately, and it is allowed that the probability of 2 or 3 or 4 nucleotide type is larger than 0.5 or even close to 1.0, and thus suggests strong evidence for a specific position with multiple bases in a test genome. Therefore, NanoCaller can easily generate multi-allelic variant calls, where all alternative alleles differ from the reference allele. Furthermore, NanoCaller can be easily configured to call variants for species with polyploidy or somatic mosaic variants when data are available to train an accurate model. (5) NanoCaller uses rescaled statistics to generate pileup for a candidate site, and rescaled statistics is independent on the coverage of a test genome, and thus, NanoCaller is able to handle a test data set with different coverage from the training data set, which might be a challenge for other long-read callers. That is, NanoCaller trained on a whole-genome data has less biases on other data sets with much lower or higher coverage, such as target-sequencing data with thousands folds of coverage. (6) With very accurate HiFi reads (<1% error rate) generated by PacBio, NanoCaller is able to yield competitive variant calling performance.

However, there are several limitations of NanoCaller that we wish to discuss here. One is that NanoCaller relies on the accurate alignment and pileup of long-read sequencing data, and incorrect alignments in low-complexity regions might still occur, complicating the variant calling process. For instance, most variants missed by NanoCaller in MHC region cannot be observed through IGV either due to alignment errors. Both continuingly improved sequencing techniques and improved alignment tools can benefit NanoCaller with better performance. But if the data is targeted at very complicated regions or aligned with very poor mapping quality, the performance of NanoCaller would be affected. Another limitation of NanoCaller is that the indel detection from mononucleotide repeats might not be accurate, especially on Nanopore long-read data which has difficulty in the basecalling of homopolymers [31,32]. In Nanopore long-read basecalling process, it is challenging to determine how many repeated nucleotides for a long consecutive array of similar Nanopore signals, potentially resulting in false indel calls at these regions, which can be post-processed from the call set.

In summary, we propose a deep-learning tool solely using long-range haplotype information for SNP calling and local multiple sequence alignments for accurate indel calling. Our evaluation on several human genomes suggests that NanoCaller performs competitively against other long-read variant callers, and can generate SNPs/indels calls in complex genomic regions. NanoCaller enables the detection of genetic variants from genomic regions that are previously inaccessible to genome sequencing, and may facilitate the use of long-read sequencing in finding disease variants in human genetic studies.

Methods

Datasets

Long-read data

Five long-read data sets for human genomes are used for the evaluation of NanoCaller: HG001, the Ashkenazim trio (consisting of son HG002, father HG003 and mother HG004), and HX1. For HG001, Oxford Nanopore Technology (ONT) rel6 FASTQ files, basecalled with Guppy 2.3.8, are downloaded from WGS consortium database[9], and aligned to the GRCh38 reference genome using minimap2 [33]; PacBio CCS alignment files for HG001 are downloaded from the GIAB database [30, 34]. For HG002, HG003 and HG004 in the Ashkenazim trio, newly released sets of ONT reads (basecalled by Guppy 3.6) and PacBio CCS reads are obtained from NIST through precisionFDA challenge website, and aligned to the GRCh38 reference genome using minimap2 [33]. Meanwhile, alignment files for an older dataset of ONT reads for HG002, basecalled by Guppy 2.3.5, are downloaded from the GIAB database [30, 34] for analysis. PacBio CLR alignment reads for the Ashkenazim trio files are downloaded from the GIAB database [30, 34]. The fifth genome HX1 was sequenced by us using PacBio [10] and Nanopore sequencing [35]. The long-read data is aligned to the GRCh38 reference genome using minimap2 [33]. Supplementary Table S1 shows the statistics of mapped reads in the five genomes where the coverage of ONT data ranges from 43 to 91 and the coverage of PacBio data is between 27 and 58.

Benchmark variant calls

The benchmark set of SNPs and indels for HG001 (version 3.3.2), and the Ashkenazim trio (v4.2 and v3.3.2) are download from the Genome in a Bottle (GIAB) Consortium [30] together with high-confidence regions for each genome. There are 3,004,071; 3,459,843; 3,430,611; and 3,454,689 SNPs for HG001, HG002, HG003 and HG004 respectively, and 516,524; 587,978; 569,180 and 576,301 indels for them, as shown in Table 1. Benchmark variant calls for HX1 were generated by using GATK on Illumina ~300X reads sequenced by us [10] (Table 1).

NanoCaller framework for variant calling

In the framework of NanoCaller for variant calling, candidate sites of SNPs and indels are defined according to an input alignment and a reference sequence. NanoCaller has two convolutional neural networks, one for SNP calling and the other for indel prediction, each requiring a different type of input. Input pileup images generated for SNP candidate sites only use long-range haplotype information. For indel candidate sites, alignment reads are phased with WhatsHap using SNP calls from NanoCaller, and then NanoCaller uses phased alignment reads to generate input pileup images by carrying out local multiple sequence alignment around each site. Afterwards, NanoCaller combines SNP and indel calls to give a final output. The details are described below.

SNP Calling in NanoCaller

There are four steps in NanoCaller to generate SNP calling result for an input genome: candidate site selection, pileup image generation of haplotype SNPs, deep learning prediction, and phasing of SNP calls.

Candidate site selection

Candidate sites of SNPs are defined according to the depth and alternative allele frequency for a specific genomic position. In NanoCaller, “SAMtools mpileup” [36] is used to generate aligned bases against each genomic position. In NanoCaller, SNP candidate sites are determined using the criteria below. For a genomic site b with reference base R,

  1. Embedded Image

  2. b is considered a SNP candidate site if the total read depth and the alternative allele frequency are both greater than specified thresholds. We set the alternative allele frequency threshold to be 15%.

Pileup image generation

After selecting all SNP candidate sites above, we determine a subset of SNP candidate sites as the set of highly likely heterozygous SNP sites (denoted by V). We extract long-range haplotype information from this subset of likely heterozygous SNP sites to create input images for all SNP candidate sites to be used in a convolutional neural network. This subset consists of SNP candidate sites with alternative allele frequencies in a specified range around 50%, and the default range is 40% to 60% for heterozygous site filtering. This range can be specified by the user depending upon the sequencing technology and read lengths. In detail, the procedure of pileup image generation is described below (as shown in Figure 2). For a SNP candidate site b:

  1. We select sites from the set V that share at least one read with b and are at most 50,000bp away from b. For SNP calling on PacBio datasets, we set this limit at 20,000bp.

  2. In each direction, upstream or downstream, of the site b, we choose 20 sites from V. If there are less than 20 such sites, we just append the final image with zeros. We denote the set of these potential heterozygous SNP sites nearby b (including b) by Z. An example is shown in Figure 2 (a). More details for how to choose these 40 nearby heterozygous sites from the set V can be found at Supplementary Materials Tables S18–S22 and Figures S2-S3.

  3. The set of reads covering b is divided into four groups, RB = {reads that support base B at b}, B ∈ {A,G,T,C}. Reads that do not support any base at b are not used.

  4. For each read group in RB with supporting base B, we count the number Embedded Image of supporting reads for site t ∈ Z with base D ∈ {A, G, T, C} (As shown in Figure 2(b)).

  5. Let Embedded Image, where g(D) is a function that returns −1 if D is the reference base at site t and 1 otherwise. An example is shown in Figure 2(c).

  6. We obtain a 4×41×4 matrix M with entries Embedded Image (as shown Figure 2(d)) where the first dimension corresponds to nucleotide type B at site b, second dimension corresponds to the number of sites t, and the third dimension corresponds to nucleotide type D at site t. Our image has read groups as rows, various base positions as columns, and has 4 channels, each recording frequencies of different bases in the given read group at the given site.

  7. We add another channel to our image which is a 4×41 matrix Embedded Image where Embedded Image if B is the reference base at site b and 0 otherwise (as shown in Figure 2(d)). In this channel, we have a row of ones for reference base at b and rows of zeroes for other bases.

  8. We add another row to the image which encodes reference bases of site in Z, and the final image is illustrated in Figure 2(e).

Deep learning prediction

In NanoCaller, we present a convolutional neural networks [37] for SNP prediction, as shown in Figure 1. The neural network has three convolutional layers: the first layer uses kernels of three different dimensions and combines the convolved features into a single output: one capture local information from a row, another from a column and the other from a 2D local region; the second and third layers use kernels of size 2×3. The output from third convolutional layer is flattened and used as input for a fully connected network with dropout (using 0.5 drop date). The first fully connected layer is followed by two different neural networks of fully connected layers to calculate two types of probabilities. In the first network, we calculate the probability of each nucleotide type B to indicate that B is present at the genomic candidate site; thus for each nucleotide type B, we have a binary label prediction. The second network combines logit output of first network with output of first fully connected hidden layer to estimate probability for zygosity (homozygous or heterozygous) at the candidate site. The second network is used only in the training to propagate errors backwards for incorrect zygosity predictions. During testing, we infer zygosity from the output of first network only.

In order to call SNPs for a test genome, NanoCaller calculates probabilities of presence of each nucleotide type at the candidate site. If a candidate site has at least two nucleotide types with probabilities exceeding 0.5, it is considered to be heterozygous, otherwise it is regarded as homozygous. For heterozygous sites, two nucleotide types with highest probabilities are chosen with a heterozygous variant call. For homozygous sites, only the nucleotide type with the highest probability is chosen: if that nucleotide type is not the reference allele, a homozygous variant call is made, otherwise, it is homozygous reference. Each of called variants is also assigned with a quality score which is calculated as −100log10 Probability (1 – P(B)), where P(B) is the probability of the alternative B (in case of multiallelic prediction we choose B to be the alternative allele with smaller probability) and recorded as a float in QUAL field of the VCF file to indicate the chance of false positive prediction: the larger the score is, the less likely that the prediction is wrong.

Phasing of SNP calls

After SNP calling, NanoCaller phases predicted SNP calls using WhatsHap [38]. By default, NanoCaller disables ‘distrust-genotypes’ and ‘include-homozygous’ settings of WhatsHap for phasing SNP calls, which would otherwise allow WhatsHap to switch variants from hetero-to homozygous and vice versa in an optimal phasing solution. Enabling these WhatsHap settings has minimal impact on NanoCaller’s SNP calling performance (as shown in the Supplementary Tables S2, S3 and S4), but increases the time required for phasing by 50-80%. NanoCaller outputs both unphased VCF file generated by NanoCaller and phased VCF file generated by WhatsHap.

Indel Calling in NanoCaller

Indel calling in NanoCaller takes a genome with phased reads as input and uses several steps below to generate indel predictions: candidate site selection, pileup image generation, deep learning prediction, and then indel sequence determination. In NanoCaller, long reads are phased with SNPs calls that are predicted by NanoCaller and phased by WhatsHap [38] (as described above).

Candidate site selection

Indel candidate sites are determined using the criteria below. For a genomic site b,

  1. Calculate:

    1. For i ∈ {0,1}, depthi = total number of reads in phase i at site b

    2. Embedded Image

    3. Embedded Image

  2. b is considered an indel candidate site if:

    1. Both depth0 and depth1 are greater than a specified depth threshold

    2. Either insertion frequency is greater than a specified insertion frequency threshold or the deletion frequency is greater than a specified deletion frequency threshold.

Thresholds for alternative allele frequency, insertion frequency, deletion frequency and read depths can be specified by the user depending on coverage and base calling error rate of the genome sequencing data.

Pileup image generation

Input image of indel candidate site is generated using the procedure below as shown in Figure 3. For an indel candidate site b:

  1. Denote by Sall, Sphase1 and Sphase2 the set of all reads, reads in a phase and reads in the other phase at site b, respectively.

  2. Let seqref be the reference sequence of length 160bp starting at site b

  3. For each set S ∈ {Sall, Sphase1, Sphase2}, do the following:

    1. For each read r ∈ S, let seqr be the 160bp long subsequence of the read starting at the site b (for PacBio datasets, we use reference sequence and alignment sequences of length 260bp).

    2. Use MUSCLE to carry out multiple sequence alignment of the following set of sequences {seqref} ∪ {seqr}r∈S as shown in Figure 3 a).

    3. Let {seq′ref} ∪ {seq′r}r∈S be the realigned sequences, where seq′ref denotes the realigned reference sequence, and seq′r denotes realignment of sequence seqr. We truncate all sequences at the length 128 from the end.

    4. For B ∈ {A, G, T, C, −} and 1 ≤ p ≤ 128, calculate Embedded Image where Embedded Image returns 1 if the base at index p of Embedded Image is B and 0 otherwise. Figure 3 c) shows raw counts CB,p for each symbol.

    5. Let M be the 5 × 128 matrix with entries MB,p as shown in Figure 3 d).

    6. Construct a 5 × 128 matrix Q, with entries Embedded Image, where Embedded Image if Embedded Image has symbol B at index p and 0 otherwise as shown in Figure 3 f). Both matrices M and Q have first dimension corresponding to the symbols {A, G, T, C, −}, and second dimension corresponding to pileup columns of realigned sequences.

    7. Construct a 5 × 128 × 2 matrix MatS whose first channel is the matrix M – Q as shown in Figure 3 e) and the second channel is the matrix Q.

  4. Concatenate the three matrices MatSall, MatSphase1 and MatSphase2 together to get a 15 × 128 × 2 matrix as input to convolutional neural network.

Deep learning prediction

In NanoCaller, we present another convolutional neural networks [37] for indel calling, as shown in Figure 1. This neural network has a similar structure as SNP calling, and the difference is the fully connected network: for indel model, the first fully connected layer is followed by two fully connected hidden layers to produce probability estimates for each of the four zygosity cases at the candidate site: homozygous-reference, homozygous-alternative, heterozygous-reference and heterozygous-alternative (i.e. heterozygous with no reference allele).

Indel sequence determination

After that, NanoCaller calculates the probabilities for four cases of zygosities: homozygous-reference, homozygous-alternative, heterozygous-reference and heterozygous-alternative. No variant call is made if the homozygous-reference label has the highest probability. If homozygous-alternative label has the highest probability, we determine consensus sequence from the multiple sequence alignment of Sall, and align it against reference sequence at the candidate site using BioPython’s pairwise2 local alignment algorithm with affine gap penalty. Alternative allele is inferred from the indel of the pairwise alignment of the two sequences. In case either of the heterozygous predictions has the highest probability, we use Sphase1 and Sphase2 to determine consensus sequences for each phase separately and align them against reference sequence. Indel calls from both phases are combined to make a final phased indel call at the candidate site.

Training and testing

For SNP calling, we have trained three convolutional neural network models on three different genomes that users can choose from. NanoCaller1 and NanoCaller2 are models trained on ONT reads of chr1-22 of HG001 (basecalled by Guppy v2.3.8) and HG002 (basecalled by Guppy v2.3.5.) respectively. NanoCaller3 is a model trained on chr1-22 of HG003 PacBio continuous long reads. For indel calling, we have trained two models on chr1-22 of HG001 using ONT (basecalled by Guppy v2.3.8) and PacBio CCS reads. NanoCaller uses the former model when calling indels on ONT datasets and the latter for PacBio datasets. All training sequencing datasets were aligned to GRCh38, and v3.3.2 GIAB’s benchmark variants were used for training all SNP and indel models.

In NanoCaller, the SNP and indel models have 137,678 parameters in total, a significantly lower number than Clair [24](2,377,818) Clairvoyante[23] (1,631,496). All parameters in NanoCaller are initiated by Xavier’s method [39]. Each model was trained for 100 epochs, using a learning rate of 1e-3 and 1e-4 for SNP and indel models respectively. We also applied L2-norm regularization, with coefficients 1e-3 and 1e-4 for SNP and indel models respectively, to prevent overfitting of the model.

To use NanoCaller on a test genome, it is reasonable that the test genome has different coverage as the genome used for training NanoCaller. To reduce the bias caused by different coverages, after generating pileup images for SNP calling, NanoCaller by default scales the raw counts of bases in pileup images to adjust for the difference between coverages of the testing genome and the genome used for training of the model selected by user, i.e we replace the counts CB,p shown in Figure 2 (b) by Embedded Image

Performance measurement

The performance of SNP/indel calling by a variant caller is evaluted against the benchmark variant tests. Several measurements of performance evaluation are used, such as precision (p), recall (r) and F1 score as defined below. Embedded Image where TP is the number of benchmark variants correctly predicted by a variant caller, and FP is the number of miscalled variants which are not in benchmark variant sets, FN is the number of benchmark variants which cannot be called by a variant caller. F1 is the weighted average of p and r, a harmonic measurement of precision and recall. The range of the three measurements is [0, 1]: the larger, the better.

Sanger validation of selected sites on HG002

To further demonstrate the performance of NanoCaller and other variant callers, we select 17 genomic regions whose SNPs/indels are not in the GIAB ground truth calls (version 3.3.2), and conduct Sanger sequencing for them on HG002. Firstly, we design PCR primers within ~400 bp of a select site of interest and then use a high-fidelity PCR enzyme (PrimeSTAR GXL DNA Polymerase, TaKaRa) to amplify each of the target selected repeat regions. The PCR products are purified using AMPure XP beads and sequenced by Sanger sequencing. We then decipher two sequences from Sanger results for variant analysis. The data and deciphered sequences are in the supplementary files. Please note that more than 17 variant sites are detected in the Sanger results, because each PCR region can contain 1+ variants.

Competing Interests

The authors declare no competing interests.

Author contributions

UA and QL developed the computational method and drafted the manuscript. UA implemented the software tool and evaluated its performance. LF conducted wet-lab experiments of Sanger sequencing of candidate variants. KW conceived the study, advised on model design and guided implementation/evaluation. All authors read, revised, and approved the manuscript.

Supplementary Materials

Statistics of long-reads datasets

View this table:
  • View inline
  • View popup
  • Download powerpoint
Table S5.

Whole genome statistics of five data sets on human genomes by Nanopore and PacBio sequencing. Each genome is aligned to the GRCh38 reference genome, and only the mapped reads were used to calculate the statistics. Total number of bases is calculated as the sum of length of all mapped reads, and coverage is defined as number of mapped bases divided by the reference genome length.’Mean’ and ‘Median’ represent the mean and median of read length in long reads for a genome.

Performance of NanoCaller and other variant callers on old Nanopore datasets of the Ashkenazim trio

View this table:
  • View inline
  • View popup
  • Download powerpoint
Table S6.

Performances of SNP predictions of NanoCaller1 and NanoCaller2 SNP models, along with existing variant callers, on old ONT data of the Ashkenazim trio, evaluated against v3.3.2 of GIAB benchmark variants.

View this table:
  • View inline
  • View popup
  • Download powerpoint
Table S7.

Performances of SNP predictions of NanoCaller1 and NanoCaller2 SNP models, along with existing variant callers, on old ONT data of the Ashkenazim trio, evaluated against v4.2 of GIAB benchmark variants.

View this table:
  • View inline
  • View popup
  • Download powerpoint
Table S8.

Performances of indel predictions in non-homopolymer regions of NanoCaller, along with existing variant callers, on old ONT data of the Ashkenazim trio, evaluated against v3.3.2 of GIAB benchmark variants.

View this table:
  • View inline
  • View popup
  • Download powerpoint
Table S9.

Performances of indel predictions in non-homopolymer regions of NanoCaller, along with existing variant callers, on old ONT data of the Ashkenazim trio, evaluated against v4.2 of GIAB benchmark variants.

SNP performance of NanoCaller and other variant callers in difficult-to-map regions on Nanopore reads of the Ashkenazim trio

BED file sources of different difficult to map regions with respect to GRCh38 reference genome:

  1. All difficult regions in Table 3 of NanoCaller manuscript.

    Source: ftp://ftp-trace.ncbi.nlm.nih.gov/ReferenceSamples/giab/release/genome-stratifications/v2.0/GRCh38/union/GRCh38_alldifficultregions.bed.gz

  2. Low mappability regions in Table S12:

    Source: ftp://ftp-trace.ncbi.nlm.nih.gov/ReferenceSamples/giab/release/genome-stratifications/v2.0/GRCh38/mappability/GRCh38_lowmappabilityall.bed.gz

  3. Segmental duplications in Table S13 Table S14.

    Source: ftp://ftp-trace.ncbi.nlm.nih.gov/ReferenceSamples/giab/release/genome-stratifications/v2.0/GRCh38/SegmentalDuplications/GRCh38_segdups.bed.gz

  4. Tandem and homopolymer repeats in Table S14.

    Source: ftp://ftp-trace.ncbi.nlm.nih.gov/ReferenceSamples/giab/release/genome-stratifications/v2.0/GRCh38/LowComplexity/GRCh38_AllTandemRepeatsandHomopolymers_slop5.bed.gz

  5. Major Histocompatibility Complex (chr6:28510020-33480577) in Table S15.

    Source: ftp://ftp-trace.ncbi.nlm.nih.gov/ReferenceSamples/giab/release/genome-stratifications/v2.0/GRCh38/OtherDifficult/GRCh38_MHC.bed.gz

View this table:
  • View inline
  • View popup
  • Download powerpoint
Table S10.

Statistics of number ground truth variants in GIAB v4.2 benchmark of the Ashkenazim trio within various difficult genomic regions identified in GIAB genome stratification v2.0. Numbers shown below are for SNPs, with the exception of MHC region for which both SNP and indel counts are shown.

View this table:
  • View inline
  • View popup
  • Download powerpoint
Table S11.

Performances of SNP predictions in all difficult genomic regions by NanoCaller1 SNP models, along with existing variant callers on ONT data, using 4.2 benchmark variants.

View this table:
  • View inline
  • View popup
  • Download powerpoint
Table S12.

Performances of SNP predictions in low mappability regions by NanoCaller1 SNP models, along with existing variant callers on ONT data, using 4.2 benchmark variants.

View this table:
  • View inline
  • View popup
  • Download powerpoint
Table S13.

Performances of SNP predictions in segmental duplication regions by NanoCaller1 SNP models, along with existing variant callers on ONT data, using 4.2 benchmark variants.

View this table:
  • View inline
  • View popup
  • Download powerpoint
Table S14.

Performances of SNP predictions in tandem and homopolymer repeat regions by NanoCaller1 SNP models, along with existing variant callers on ONT data, using 4.2 benchmark variants.

View this table:
  • View inline
  • View popup
  • Download powerpoint
Table S15.

Performances of SNPs, indels and overall variants predictions in Major Histocomplatibility Complex by NanoCaller1 SNP models, along with existing variant callers on ONT data, using 4.2 benchmark variants. Overall variant performance is calculated by combining the SNPs and indels performances shown.

Novel variants validated by Sanger sequencing

View this table:
  • View inline
  • View popup
Table S16.

Sanger validated variants, and performance of predictions in HG002 genome by various variant callers using new ONT reads. These variants are missing in GIAB v3.3.2 of HG002.

Quality score thresholds for filtering variant calls

We provide precision/recall/F1 statistics for each variant caller with respect to the recommended quality score, or by taking average over quality scores giving highest F1-score when evaluated by RTG’s vcfeval. Here we show the quality thresholds we used for different Nanopore models on various datasets to show how the quality of reads, depth and quality of ground truth variants effects the optimal quality score. The range of SNP quality scores is 30-999. Table S17 shows that SNP F1-scores of both NanoCaller1 and NanoCaller2 SNP models are very resilient to small changes on quality score thresholds. For NanoCaller1 and NanoCaller3 SNP models, we get the best performance on PacBio CCS reads when we do not use any threshold, which corresponds to a quality score threhsold of 30. For PacBio CLR reads, genome coverage significantly effects SNP calling performance, thus we choose different thresholds for HG001/HG002 and HG003/HG004, as shown in Table S18.

View this table:
  • View inline
  • View popup
  • Download powerpoint
Table S17.

F1 scores of NanoCaller1 and NanoCaller2 SNP models on ONT datasets. We recommend 162 and 78 as quality thresholds for NanoCaller1 and NanoCaller2 models, respectively. The column ‘Quality Score Range’ shows the range of quality scores that give the optimal SNP F1-score shown in the ‘Best F1’ column. Third colum under each model shows the F1-score when we use the recommended tquality threshold.

View this table:
  • View inline
  • View popup
  • Download powerpoint
Table S18.

F1 scores of NanoCaller1 and NanoCaller3 SNP models on CLR datasets. We show recommended thresholds for higher coverage genomes HG001/HG002 and lower coverage genomes HG003/HG004. The column ‘Quality Score Range’ shows the range of quality scores that give the optimal SNP F1-score shown in the ‘Best F1’ column. Third and fourth columns under each model shows the F1-score when we use the recommended quality threshold.

View this table:
  • View inline
  • View popup
  • Download powerpoint
Table S19.

F1 scores of NanoCaller ONT and PacBio indel models. We recommend 144 and 25 as quality thresholds for ONT and PacBio models, respectively. The column ‘Quality Score Range’ shows the range of quality scores that give the optimal indel F1-score shown in the ‘Best F1’ column. Third colum under each model shows the F1-score when we use the recommended quality threshold.

View this table:
  • View inline
  • View popup
  • Download powerpoint
Table S20.

Quality thresholds used for different variant callers for different sequencing technologies. For Clair and PacBio we use the developer recommended thresholds, whereas DeepVariant developers do not recommend using any threshold. For Medaka, we calculated the average of best quality scores for SNPs over five ONT genomes, and used the average as final SNP quality cut-off, and repeated the same procedure for indels as well. For WhatsHap, best results were obtained without any quality cut-off as well.

Runtime Comparisons

View this table:
  • View inline
  • View popup
  • Download powerpoint
Table S21.

Wall-clock runtimes for various variant callers using Intel Xeon CPU E5-2683 v4 @ 2.10GHz. For variant callers that support parallelization we used 16 CPUs. We used HG002 ONT 49X, PacBio CCS 35X and PacBio CLR 58X datasets for evaluation.

Selection of nearby potentially heterozygous sites for SNP calling feature generation

View this table:
  • View inline
  • View popup
  • Download powerpoint
Table S22.

Number of potentially heterozygous Sites chosen for each candidate site under two methods. Method 1 is used for SNP calling ONT reads and selects the given number of sites from each range given. Method 2 is used for SNP calling on PacBio reads. In each range for either method, we chose the specified number of sites that are closest to the candidate site because they share more reads with the candidate site. The design of these two methods was motivated by the difference in read length distribution of ONT and PacBio reads, shown in Figure S9. An illustration of the two methods is shown in Figure S10.

View this table:
  • View inline
  • View popup
  • Download powerpoint
Table S23.

Performance of NanoCaller SNP calling on HG004 ONT, CCS and CLR reads with various thresholds used for define heterozygous SNPs. All the results show the best F1-score achieved using each threshold, on v4.2 benchmark variants.

View this table:
  • View inline
  • View popup
  • Download powerpoint
Table S24.

Performance of NanoCaller SNP calling on HG004 ONT and CCS reads with various thresholds for minimum number (first column) of heterozygous SNPs required for a a candidate site. All the results show the best F1-score achieved using each threshold, on v4.2 benchmark variants.

View this table:
  • View inline
  • View popup
  • Download powerpoint
Table S25.

Performance of NanoCaller SNP calling, in whole genome and various difficult genomic regions, for HG002, HG003 and HG004 ONT reads using both, Method1 and Method2 described in Table S22 and Error! Reference source not found. All the results show the best F1-score achieved using each method for both NanoCaller1 and NanoCaller2 SNP models, with evaluation on v4.2 benchmark variants.

View this table:
  • View inline
  • View popup
  • Download powerpoint
Table S26.

Performance of NanoCaller whole genome SNP calling for HG002, HG003 and HG004 PacBio CCS and CLR reads using both, Method1 and Method2 described in Table S22 and Error! Reference source not found. All the results show the best F1-score achieved using each method with NanoCaller1 SNP model, with evaluation on v4.2 benchmark variants.

Effects of allowing WhatsHap to change genotype on SNP calling performance

View this table:
  • View inline
  • View popup
  • Download powerpoint
Table S27.

Performance of SNP calling by NanoCaller with and without the use of ‘distrust genotypes’ for phasing by WhatsHap. All the results shown elsewhere besides this table do not use ‘distrust genotype’ option whiich allows WhatsHap to change genotypes. NanoCaller users can enable ‘distrust genotype’ option by setting ‘enable_whatshap’ flag in NanoCaller run. By default this option is turned off in NanoCaller.

Figures

Figure S8.
  • Download figure
  • Open in new tab
Figure S8.

Concordance of ground truth variants correctly predicted by various variant callers on Nanopore reads of the Ashkenazim trio. Venn diagrams show the overlap of v4.2 ground truth variant calls predicted correctly by NanoCaller, Medaka and Clair. a) SNPs, b) indels. All variants are inside high-confidence regions.

Figure S9.
  • Download figure
  • Open in new tab
Figure S9.

Read length distributions of 88X ONT, 35X PacBio CCS and 27X PacBio CLR reads of HG004.

Figure S10.
  • Download figure
  • Open in new tab
Figure S10.

Illustration of number of potentially heterozygous SNP sites chosen in each range for the two methods.

Acknowledgements

The authors would like to thank members of the Wang lab for valuable comments and feedback. We would like to thank GIAB and nanopore-wgs-consortium for providing the sequencing data sets and the gold standard variant call data for use in our evaluation. We would like to thank PrecisionFDA team for organizing the variant calling Truth Challenge on difficulty-to-map regions and for scoring the submissions. This study is in part supported by NIH/NIGMS grant GM132713 to KW.

Footnotes

  • Improve the whole manuscript by clarification and more analysis and comparison of NanoCaller against other methods.

References

  1. 1.↵
    McKenna A, Hanna M, Banks E, Sivachenko A, Cibulskis K, Kernytsky A, Garimella K, Altshuler D, Gabriel S, Daly M, DePristo MA: The Genome Analysis Toolkit: a MapReduce framework for analyzing next-generation DNA sequencing data. Genome Res 2010, 20:1297–1303.
    OpenUrlAbstract/FREE Full Text
  2. 2.↵
    Garrison E. G. M.: Haplotype-based variant detection from short-read sequencing. arXiv 2012, 1207.3907.
  3. 3.↵
    Krusche P, Trigg L, Boutros PC, Mason CE, De La Vega FM, Moore BL, Gonzalez-Porta M, Eberle MA, Tezak Z, Lababidi S, et al: Best practices for benchmarking germline small-variant calls in human genomes. Nat Biotechnol 2019, 37:555–560.
    OpenUrlCrossRef
  4. 4.
    Zook JM, McDaniel J, Olson ND, Wagner J, Parikh H, Heaton H, Irvine SA, Trigg L, Truty R, McLean CY, et al: An open resource for accurately benchmarking small variant and reference calls. Nat Biotechnol 2019, 37:561–566.
    OpenUrl
  5. 5.↵
    Cameron DL, Di Stefano L, Papenfuss AT: Comprehensive evaluation and characterisation of short read general-purpose structural variant calling software. Nature Communications 2019, 10:3240.
    OpenUrl
  6. 6.↵
    Branton D, Deamer DW, Marziali A, Bayley H, Benner SA, Butler T, Di Ventra M, Garaj S, Hibbs A, Huang X, et al: The potential and challenges of nanopore sequencing. Nat Biotechnol 2008, 26:1146–1153.
    OpenUrlCrossRefPubMedWeb of Science
  7. 7.↵
    Eid J, Fehr A, Gray J, Luong K, Lyle J, Otto G, Peluso P, Rank D, Baybayan P, Bettman B, et al: Real-time DNA sequencing from single polymerase molecules. Science 2009, 323:133–138.
    OpenUrlAbstract/FREE Full Text
  8. 8.↵
    Mantere T, Kersten S, Hoischen A: Long-Read Sequencing Emerging in Medical Genetics. Front Genet 2019, 10:426.
    OpenUrl
  9. 9.↵
    Jain M, Koren S, Miga KH, Quick J, Rand AC, Sasani TA, Tyson JR, Beggs AD, Dilthey AT, Fiddes IT, et al: Nanopore sequencing and assembly of a human genome with ultra-long reads. Nat Biotechnol 2018, 36:338–345.
    OpenUrlCrossRefPubMed
  10. 10.↵
    Shi L, Guo Y, Dong C, Huddleston J, Yang H, Han X, Fu A, Li Q, Li N, Gong S, et al: Long-read sequencing and de novo assembly of a Chinese genome. Nat Commun 2016, 7:12065.
    OpenUrlCrossRef
  11. 11.
    Pendleton M, Sebra R, Pang AW, Ummat A, Franzen O, Rausch T, Stutz AM, Stedman W, Anantharaman T, Hastie A, et al: Assembly and diploid architecture of an individual human genome via single-molecule technologies. Nat Methods 2015, 12:780–786.
    OpenUrlCrossRefPubMed
  12. 12.
    Seo JS, Rhie A, Kim J, Lee S, Sohn MH, Kim CU, Hastie A, Cao H, Yun JY, Kim J, et al: De novo assembly and phasing of a Korean human genome. Nature 2016, 538:243–247.
    OpenUrlCrossRefPubMed
  13. 13.↵
    Cho YS, Kim H, Kim HM, Jho S, Jun J, Lee YJ, Chae KS, Kim CG, Kim S, Eriksson A, et al: An ethnically relevant consensus Korean reference genome is a step towards personal reference genomes. Nat Commun 2016, 7:13637.
    OpenUrl
  14. 14.↵
    Stephens Z, Wang C, Iyer RK, Kocher JP: Detection and visualization of complex structural variants from long reads. BMC Bioinformatics 2018, 19:508.
    OpenUrl
  15. 15.
    Heller D, Vingron M: SVIM: structural variant identification using mapped long reads. Bioinformatics 2019, 35:2907–2915.
    OpenUrl
  16. 16.
    Jiang T, Liu B, Jiang Y, Li J, Gao Y, Cui Z, Liu Y, Wang Y: Long-read-based Human Genomic Structural Variation Detection with cuteSV. bioRxiv 2019:780700.
  17. 17.
    Sedlazeck FJ, Rescheneder P, Smolka M, Fang H, Nattestad M, von Haeseler A, Schatz MC: Accurate detection of complex structural variations using single-molecule sequencing. Nat Methods 2018, 15:461–468.
    OpenUrlCrossRefPubMed
  18. 18.
    Fang L, Hu J, Wang D, Wang K: NextSV: a meta-caller for structural variants from low-coverage long-read sequencing data. BMC Bioinformatics 2018, 19:180.
    OpenUrlCrossRef
  19. 19.↵
    Gong L, Wong CH, Cheng WC, Tjong H, Menghi F, Ngan CY, Liu ET, Wei CL: Picky comprehensively detects high-resolution structural variants in nanopore long reads. Nat Methods 2018, 15:455–460.
    OpenUrlCrossRef
  20. 20.↵
    Ameur A, Kloosterman WP, Hestand MS: Single-Molecule Sequencing: Towards Clinical Applications. Trends Biotechnol 2019, 37:72–85.
    OpenUrlPubMed
  21. 21.↵
    Wenger AM, Peluso P, Rowell WJ, Chang PC, Hall RJ, Concepcion GT, Ebler J, Fungtammasan A, Kolesnikov A, Olson ND, et al: Accurate circular consensus long-read sequencing improves variant detection and assembly of a human genome. Nat Biotechnol 2019, 37:1155–1162.
    OpenUrl
  22. 22.↵
    Poplin R, Chang PC, Alexander D, Schwartz S, Colthurst T, Ku A, Newburger D, Dijamco J, Nguyen N, Afshar PT, et al: A universal SNP and small-indel variant caller using deep neural networks. Nat Biotechnol 2018, 36:983–987.
    OpenUrlCrossRefPubMed
  23. 23.↵
    Luo R, Sedlazeck FJ, Lam TW, Schatz MC: A multi-task convolutional deep neural network for variant calling in single molecule sequencing. Nat Commun 2019, 10:998.
    OpenUrlCrossRef
  24. 24.↵
    Luo R, Wong C-L, Wong Y-S, Tang C-I, Liu C-M, Leung C-M, Lam T-W: Exploring the limit of using a deep neural network on pileup data for germline variant calling. Nature Machine Intelligence 2020, 2:220–227.
    OpenUrl
  25. 25.↵
    Edge P, Bansal V: Longshot enables accurate variant calling in diploid genomes from single-molecule long read sequencing. Nat Commun 2019, 10:4660.
    OpenUrlCrossRef
  26. 26.↵
    Edge P, Bafna V, Bansal V: HapCUT2: robust and accurate haplotype assembly for diverse sequencing technologies. Genome research 2017, 27:801–812.
    OpenUrlAbstract/FREE Full Text
  27. 27.↵
    medaka: Sequence correction provided by ONT Research [https://github.com/nanoporetech/medaka]
  28. 28.↵
    Cleary JG, Braithwaite R, Gaastra K, Hilbush BS, Inglis S, Irvine SA, Jackson A, Littin R, Rathod M, Ware D, et al: Comparing Variant Call Files for Performance Benchmarking of Next-Generation Sequencing Variant Calling Pipelines. bioRxiv 2015:023754.
  29. 29.↵
    Olson ND, Wagner J, McDaniel J, Stephens SH, Westreich ST, Prasanna AG, Johanson E, Boja E, Maier EJ, Serang O, et al: precisionFDA Truth Challenge V2: Calling variants from short- and long-reads in difficult-to-map regions. bioRxiv 2020:2020.2011.2013.380741.
  30. 30.↵
    Zook JM, Catoe D, McDaniel J, Vang L, Spies N, Sidow A, Weng Z, Liu Y, Mason CE, Alexander N, et al: Extensive sequencing of seven human genomes to characterize benchmark reference materials. Scientific Data 2016, 3:160025.
    OpenUrl
  31. 31.↵
    Rang FJ, Kloosterman WP, de Ridder J: From squiggle to basepair: computational approaches for improving nanopore sequencing read accuracy. Genome Biol 2018, 19:90.
    OpenUrlCrossRef
  32. 32.↵
    Zascavage RR, Thorson K, Planz JV: Nanopore sequencing: An enrichment-free alternative to mitochondrial DNA sequencing. Electrophoresis 2019, 40:272–280.
    OpenUrlCrossRef
  33. 33.↵
    Li H: Minimap2: pairwise alignment for nucleotide sequences. Bioinformatics 2018, 34:3094–3100.
    OpenUrlCrossRefPubMed
  34. 34.↵
    GENOME IN A BOTTLE [https://jimb.stanford.edu/giab]
  35. 35.↵
    Liu Q, Fang L, Yu G, Wang D, Xiao CL, Wang K: Detection of DNA base modifications by deep recurrent neural network on Oxford Nanopore sequencing data. Nat Commun 2019, 10:2449.
    OpenUrl
  36. 36.↵
    Li H, Handsaker B, Wysoker A, Fennell T, Ruan J, Homer N, Marth G, Abecasis G, Durbin R: The Sequence Alignment/Map format and SAMtools. Bioinformatics 2009, 25:2078–2079.
    OpenUrlCrossRefPubMedWeb of Science
  37. 37.↵
    Krizhevsky A, Sutskever I, Hinton GE: ImageNet classification with deep convolutional neural networks. Commun ACM 2017, 60:84–90.
    OpenUrlCrossRef
  38. 38.↵
    Patterson M, Marschall T, Pisanti N, van Iersel L, Stougie L, Klau GW, Schonhuth A: WhatsHap: Weighted Haplotype Assembly for Future-Generation Sequencing Reads. J Comput Biol 2015, 22:498–509.
    OpenUrlCrossRefPubMed
  39. 39.↵
    1. Yee Whye T,
    2. Mike T
    Glorot X, Bengio Y: Understanding the difficulty of training deep feedforward neural networks. In Proceedings of the Thirteenth International Conference on Artificial Intelligence and Statistics (Yee Whye T, Mike T eds.), vol. 9. pp. 249–256. Proceedings of Machine Learning Research: PMLR; 2010:249–256.
    OpenUrl
Back to top
PreviousNext
Posted November 30, 2020.
Download PDF
Email

Thank you for your interest in spreading the word about bioRxiv.

NOTE: Your email address is requested solely to identify you as the sender of this article.

Enter multiple addresses on separate lines or separate them with commas.
NanoCaller for accurate detection of SNPs and indels in difficult-to-map regions from long-read sequencing by haplotype-aware deep neural networks
(Your Name) has forwarded a page to you from bioRxiv
(Your Name) thought you would like to see this page from the bioRxiv website.
CAPTCHA
This question is for testing whether or not you are a human visitor and to prevent automated spam submissions.
Share
NanoCaller for accurate detection of SNPs and indels in difficult-to-map regions from long-read sequencing by haplotype-aware deep neural networks
Umair Ahsan, Qian Liu, Li Fang, Kai Wang
bioRxiv 2019.12.29.890418; doi: https://doi.org/10.1101/2019.12.29.890418
Digg logo Reddit logo Twitter logo Facebook logo Google logo LinkedIn logo Mendeley logo
Citation Tools
NanoCaller for accurate detection of SNPs and indels in difficult-to-map regions from long-read sequencing by haplotype-aware deep neural networks
Umair Ahsan, Qian Liu, Li Fang, Kai Wang
bioRxiv 2019.12.29.890418; doi: https://doi.org/10.1101/2019.12.29.890418

Citation Manager Formats

  • BibTeX
  • Bookends
  • EasyBib
  • EndNote (tagged)
  • EndNote 8 (xml)
  • Medlars
  • Mendeley
  • Papers
  • RefWorks Tagged
  • Ref Manager
  • RIS
  • Zotero
  • Tweet Widget
  • Facebook Like
  • Google Plus One

Subject Area

  • Bioinformatics
Subject Areas
All Articles
  • Animal Behavior and Cognition (3479)
  • Biochemistry (7318)
  • Bioengineering (5296)
  • Bioinformatics (20196)
  • Biophysics (9976)
  • Cancer Biology (7703)
  • Cell Biology (11250)
  • Clinical Trials (138)
  • Developmental Biology (6417)
  • Ecology (9916)
  • Epidemiology (2065)
  • Evolutionary Biology (13278)
  • Genetics (9352)
  • Genomics (12552)
  • Immunology (7674)
  • Microbiology (18938)
  • Molecular Biology (7417)
  • Neuroscience (40889)
  • Paleontology (298)
  • Pathology (1226)
  • Pharmacology and Toxicology (2126)
  • Physiology (3140)
  • Plant Biology (6838)
  • Scientific Communication and Education (1270)
  • Synthetic Biology (1891)
  • Systems Biology (5296)
  • Zoology (1085)