Symphonizing pileup and full-alignment for deep learning-based long-read variant calling

Deep learning-based variant callers are becoming the standard and have achieved superior SNP calling performance using long reads. In this paper, we present Clair3, which leveraged the best of two major method categories: pile-up calling handles most variant candidates with speed, and full-alignment tackles complicated candidates to maximize precision and recall. Clair3 ran faster than any of the other state-of-the-art variant callers and performed the best, especially at lower coverage.

algorithms are usually superior in terms of time efficiency and the full-alignment algorithms provide the best precision and recall. More characteristics and limitations of the two designs are discussed in Clair 3 and DeepVariant 1 , respectively. However, while the two designs are not mutually exclusive, there have not been any studies combining pileup calling and fullalignment calling.
To fill the gap, we developed Clair3, the successor to Clair, which makes the best of both designs. It runs as fast as the pileup-based callers and performs as well as the full alignmentbased callers. Extended Data Figure 1 shows the workflow for Clair3. The philosophy behind Clair3 is to trust the full-alignment model unless the pileup model can make a quick but reliable decision. First, the pileup calling network goes through all the variant candidates that passed a coverage threshold and an alternative allele frequency threshold. Next, the high-quality pileup calls are used to phase the alignments and as part of the final output.
Then, the alignments phased by WhatsHap 7 are used to generate full-alignment input that is ~23 times larger in size than the pileup input for each low-quality pileup call for fullalignment calling. Finally, the full-alignment calls are integrated with the high-quality pileup calls as the final output. More details and parameters about the Clair3 workflow, input/output, and network architecture are provided in Methods.
We benchmarked Clair3 v0.1-r11 against PEPPER r0.8, Medaka v1.4.4 (that last version of ONT's in-house tool that supported variant calling), Longshot 8 v0.4.5 (non-deep learningbased; works only with SNP), and Clair v2.1.1 (the Clair3 predecessor) on two GIAB 9, 10 3 samples: HG003 and HG004. HG003 was tested on models (including a pileup and a fullalignment model) trained on HG001, 2, 4 and 5. HG004 was tested on models trained on HG001, 2, 3 and 5. The model availability and training details are in Methods. We primarily benchmarked ONT data base-called using Guppy 5 (version 5.0.14) data, but in addition we also benchmarked Guppy 4 (version 4.2.2) for two reasons: 1) variant caller's capability can be shown on noisier data -compared to the Guppy 5, which was released in mid-2021, Guppy 4's read accuracy is ~1.8% lower 11  The benchmarking results at coverage from 10x to 50x of Guppy 5 data are shown in Figure   1a, Supplementary Table 2, and Supplementary Table 3. The observations of different tools on HG003 and HG004 are almost identical, ruling out the possibility of any tools' overfitting to a particular sample. In terms of the SNP F1-score, Clair3 outperformed all other tools at the more challenging lower coverage (10x to 20x). Above 20x, Clair3 performed similar to PEPPER above 20x, but ran much faster. Different from the SNP F1-score that plateaued above 20x, the Indel F1-score kept increasing with coverage. Looking at the precision and recall at 20x, which is an optimistic estimation of the minimal coverage achievable per PromethION flowcell in production, in terms of SNP (Figure 1b)  The success of the Clair3 method lies in the effective distinction between true and false calls during pileup calling, so that only necessary candidates are sent to the much more computationally intensive full-alignment calling. Figure 2a shows that an effective distinction was achieved using variant quality. Using HG003 at 50x as an example, most false variant calls and false reference calls had a quality between 0 to 10, while the true calls were between 15 to 30. In reality, while the correctness of a pileup call is not known in advance, we empirically decided to send the bottom 30% of the pileup variant calls and the bottom 10% of the pileup reference calls to full-alignment calling as the default settings of Clair3 (See Methods). In the previous example, quality cut-off 16 was chosen for the variant calls, which resulted in 96% of the false variant calls and only 7% of the true variant calls being sent to full-alignment calling. Similarly, cut-off 16 was chosen for the reference calls, so that 98% of the false reference calls and only 13% of the true reference calls were sent to full-alignment calling. Figure 2b shows that ~60% of the pileup failed variant calls and ~30% of the pileup failed reference calls were correctly called in full-alignment calling. We tested sending different percentages of pileup variant calls to full-alignment calling, from 0% (pileup calling only) to 100% (full-alignment calling only). The results are shown in Figure 2c and Supplementary Table 8. Clair3's default, which had a similar performance to fullalignment calling but ran ~3 times faster, showed that integrating pileup and full-alignment calling is a better strategy than relying on only one of them. To eliminate the concern of model overfitting, we benchmarked chromosome 20, which has always been excluded from model training. The results shown in Supplementary Table 7 concluded that no overfitting is observed in Clair3.

5
The benchmarks focused on the more challenging ONT data, but the Clair3 method is not restricted to a particular sequencing technology. It should work particularly well in terms of both runtime and performance on noisy data. Having integrated plenty of feedback from the community and ONT, Clair3 is currently in its eleventh revision. We observed that ONT has removed the variant calling submodule from Medaka and suggested Clair3 for variant calling since v1.5.0 14 . We also observed that in PEPPER's r0.7 15 update, a module in the front of the pipeline that was used solely for variant candidate selection was repurposed to output summary-based variant calls to relieve the heavy full-alignment calling workload, which is converging to Clair3's idea. We expect integrating pileup and full-alignment calling to be a common practice in deep learning-based variant calling in the future.

Method
The Clair3 workflow As Extended Data Figure 1 shows, pileup candidates that are above a coverage threshold and an allele frequency threshold are extracted, and then called using the pileup network.
The pileup calls are grouped into variant calls (genotype 0/1, 1/1, and 1/2) and reference calls (0/0). Both groups are ranked according to variant quality (QUAL). High-quality heterozygous SNP calls (top 70% in 0/1 calls) are used in WhatsHap phasing to produce phased alignment for input to the full-alignment network. Low-quality pileup calls (defaulted to the lowest 30% of variants and 10% of reference calls) are then called again using the full-alignment network. Finally, the full-alignment calls and high-quality pileup calls are outputted. Clair3 supports both VCF and GVCF output formats.

Input/Output
Clair3 uses a pileup input design simplified from that of its predecessors, and a fullalignment input to cover as many details in the read alignments as possible. Supplementary indel. For example, a 3bp deletion with the most reads support will have the first deleted base counted in either D 1 S+ or D 1 S-, and the second and third deleted bases counted in either DR+ or DR-. The design was determined experimentally, but the rationale is that for 1bp indels that are easy to call, look into the differences between the "S" counts, but reduce the quality if the "R" counts and discrepancy between positions increase. Supplementary with an exact length will output the most reads-supported allele at that length. Otherwise, the most reads-supported allele below -15bp/ above 15bp is outputted. In training, indel length task 1 is given the smaller number, and in all our variant calling experiments, no length predictions in task 1 larger than in task 2 were observed. The maximum supported coverage of full-alignment input was 89. If the coverage was above 89, random subsampling of reads was applied. If the coverage was below 89, zero-padding was applied with reads placed at the center of the input. The maximum supported coverage of full-alignment input can be increased by changing the "matrix_depth_dict" variable in the "param_f.py" configuration file.

Network architecture
The pileup and full-alignment networks are shown in Supplementary Figure 3 Clair3's full-alignment network.
We tried removing a residual block from the full-alignment network, the overall F1-score reduced by ~4% in average in multiple experiments with HG003 and HG004, and coverage from 10x to 50x. Adding a residual block, the overall F1-score improvements were unnoticeable, but the network speed slowed down by 20%. Removing the SPP layer reduced the Indel F1-score by ~10% at 10x coverage. More results of removing the insertion, phasing, MQ, BQ channel, or the two Indel length tasks are shown in the Supplementary are continually updated. The pretrained models, while targeted for use in production, were trained using multiple GIAB samples with known variants and 10 coverages for each sample (more details in the "Training data augmentation using subsampled coverage" section in and among all equivalents, the candidate haplotype with the most reads support is selected. The variants in the haplotype are used as the new truth variants. This step is computationally intensive, so in practice, we applied the step to partitions with at most 15 candidates and required less than 100bp between the candidates. Fourth, low alternative allele frequency (≥0.08 but <0.15) candidates with in situ matches between the alignments and the truth variants were chosen. Fifth, the truth variants or unified variants generated in steps 2, 3 and 4 were merged. In our benchmarks, representation unification alone in general increased the SNP recall by ~0.2% and Indel recall by ~2%.
(2) Ratio of variants to non-variants samples for training: In Clair, the ratio was fixed at 1:2. In Clair3, we tested ratios up to 1:10 for both pileup and full-alignment model training, and we observed a monotonic but decelerated performance increase with more non-variants added to the training. Since focal loss is used to alleviate the effect of training class imbalance, another possible explanation is that the 21-genotype output task that Clair3 relies primarily on is insensitive to the ratio because it judges only the genotype of a candidate instead of whether a candidate is a variant or not. We chose 1:5 Variant calling is more difficult in the "low complexity" and "difficult to map" regions. In addition to selecting candidates by pileup calling quality ranking for fullalignment calling, we added those candidates at positions with relatively low sequence entropy (the lowest 30% of the whole genome). About three times more candidates were selected for full-alignment calling, but the performance increase was negligible.

Data availability
The 1) links to the reference genomes, truth variants, benchmarking materials, and ONT data, and 2) the commands and parameters used in this study, are available in the Supplementary Notes. All analysis output, including the VCFs and running logs, are available at http://www.bio8.cs.hku.hk/clair3/analysis_result. shows that integrating pileup and full-alignment calling is a better strategy than relying on only one of them.