Nanopore sequencing with unique molecular identifiers enables accurate mutation analysis and haplotyping in the complex Lipoprotein(a) KIV-2 VNTR

Background Repetitive genome regions, such as variable number of tandem repeats (VNTR) or short tandem repeats (STR), are major constituents of the uncharted dark genome and evade conventional sequencing approaches. The protein-coding LPA kringle IV type-2 (KIV-2) VNTR (5.6 kb per unit, 1-40 units per allele) is a medically highly relevant example with a particularly intricate structure, multiple haplotypes, intragenic homologies and an intra-VNTR STR. It is the primary regulator of plasma lipoprotein(a) [Lp(a)] concentrations, an important cardiovascular risk factor. However, despite Lp(a) variance is mostly genetically determined, Lp(a) concentrations vary widely between individuals and ancestries. This VNTR region hides multiple causal variants and functional haplotypes. Methods We evaluated the performance of amplicon-based nanopore sequencing with unique molecular identifiers (UMI-ONT-Seq) for SNP detection, haplotype mapping, VNTR unit consensus sequence generation and copy number estimation via coverage-corrected haplotypes quantification in the KIV-2 VNTR. We used 15 human samples and low-level mixtures (0.5% to 5%) of KIV-2 plasmids as a validation set. We then applied UMI-ONT-Seq to extract KIV-2 VNTR haplotypes in 48 multi-ancestry 1000-Genome samples and analyzed at scale a poorly characterized STR within the KIV-2 VNTR. Results UMI-ONT-Seq detected KIV-2 SNPs down to 1% variant level with high sensitivity, specificity and precision (0.977±0.018; 1.000±0.0005; 0.993±0.02) and accurately retrieved the full-length haplotype of each VNTR unit. Human variant levels were highly correlated with next-generation sequencing (R2=0.983) without bias across the whole variant level range. Six reads per UMI produced sequences of each KIV-2 unit with Q40-quality. The KIV-2 repeat number determined by coverage-corrected unique haplotype counting was in close agreement with droplet digital PCR (ddPCR), with 70% of the samples falling even within the narrow confidence interval of ddPCR. We then analyzed 62,679 intra-KIV-2 STR sequences and identified ancestry-specific STR patterns. Finally, we characterized the KIV-2 haplotype patterns across multiple ancestries. Conclusions UMI-ONT-Seq accurately retrieves the SNP haplotype and precisely quantifies the VNTR copy number of each repeat unit of the complex KIV-2 VNTR region across multiple ancestries. This study utilizes the KIV-2 VNTR, presenting a novel and potent tool for comprehensive characterization of medically relevant complex genome regions at scale.

The LPA gene encodes the apolipoprotein(a) [apo(a)] and controls most (>90%) of the lipoprotein(a) [Lp(a)] plasma variation [16].High Lp(a) plasma concentrations are considered a nearly monogenically determined, very frequent, causal, independent and heritable risk factor for atherosclerotic cardiovascular diseases [17][18][19] that increase cardiovascular risk up to threefold [20,21].Elevated Lp(a) concentrations are found in ≈20% of White individuals and even in ≈50% of individuals of African ancestry [17].Median Lp(a) levels vary tenfold between ancestries [22] and the individual plasma concentrations vary even 1000fold [16].The causes of this huge phenotypic variance are not fully understood but likely result from intricate, haplotype-dependent, non-linear interactions between multiple functional LPA variants and the KIV-2 VNTR size [15].
The complex structure of the LPA gene severely complicates genetic analysis [15].It comprises ten highly homologous kringle-IV domains (KIV-1 to -10) [23,24].Each KIV domain consists of two short exons (≈160 and 182 bp) interspaced by a mostly ≈4 kb intra-kringle intron and a 1.2 kb intron linking it to the next domain [15].The KIV-2 domain is encoded by a polymorphic VNTR, which introduces 1 to ≈40 KIV-2 units per gene allele (5.6 kb per repeat unit) [23].This creates an up to >200 kb VNTR region consisting of highly homologous coding repeat units that encompass up to 70 % of the protein [25].The VNTR copy number explains 30-70 % of Lp(a) variance in a non-linear, ancestry-dependent manner [16].Individuals carrying at least one low molecular weight (LMW) apo(a) isoform (defined as 10-22 KIV units [16], i.e. 1 to 13 KIV-2 units [15]) present 5 to 10 times higher median Lp(a) levels compared to high molecular weight isoforms (>22 KIV; HMW) due to higher protein production rates [17].However, the individual Lp(a) levels within the same isoforms can still vary 10 to 200-fold [15] due to multiple, partially unknown genetic variants that modify the effect of the VNTR [15].
The interactions between the KIV-2 VNTR size and modifier SNPs are complex and multilayered (reviewed in detail in [15]).They are haplotype-dependent and only partially captured by linkage disequilibrium (LD) [15].Indeed, several functional SNPs, including the two SNPs (4925G>A [25] and 4733G>A [26]).These two variants alone explain remarkable 11.9% of the Lp(a) variance in the general population, are ancestry-specific, are associated with considerably reduced cardiovascular risk and were hidden in the KIV-2 VNTR until recently [25][26][27][28].The background apo(a) isoform size and other variants on the same haplotype can both limit and amplify the effects of any functional variant [15].Although the KIV-2 VNTR encompasses most of the LPA gene region, the full genetic and haplotypic diversity of KIV-2 units and the LD of KIV-2 haplotypes with the haplotypes in the non-repetitive kringles remain largely unknown.
Current short-read deep sequencing approaches confidently identify KIV-2 SNPs [24], but mostly lose the long-range SNP haplotype data.Early cloning studies identified three synonymous KIV-2 haplotypes named KIV-2A, KIV-2B and KIV-2C [29,30].These KIV-2 subtypes are defined by the haplotype of three SNPs in KIV-2 exon 1 and differ by about 120 bases [23,24].The KIV-2 subtypes have no functional relevance, but their frequencies differ widely between ancestries and correlate with known differences of the Lp(a) phenotype across ancestries [24,30].They may thus reflect distinct evolutionary histories of the KIV-2 region and tag novel ancestry-specific functional variants.Further haplotypic effects in the KIV-2 VNTR are unknown and could not be studied so far.
Nanopore sequencing (ONT-Seq; Oxford Nanopore Technologies, ONT) provides novel means to address this knowledge gap.DNA is sequenced by monitoring the sequence-specific current fluctuations generated by single-stranded DNA molecules translocating through protein pores [31,32].This generates hundred times longer read lengths than short-read next-generation sequencing (NGS) [14,32] and provides single molecule resolution.It thus allows retrieving the complete haplotype of each DNA molecule sequenced, even in DNA mixtures [32].However, at the single molecule level the benefits of nanopore sequencing are limited by its relatively high raw-read error rate (0.7-1% median error rate per read).
Especially in highly similar repeat sequences like the KIV-2, errors cannot be polished by sequencing deeper (because the parental repeat of each read is unknown [33]) or by using double-stranded ("duplex") basecalling (because of erroneous matching of strands originating from different parental molecules).
Coupling of ONT-Seq with unique molecular identifiers (UMI-ONT-Seq) allows lowering the ONT-Seq error rate considerably [33,34].UMIs are oligonucleotide libraries that randomly tag each template molecule with a unique identifier (Figure 1).The tagged library is amplified by PCR to generate multiple copies of each UMI-tagged template molecule and full-length sequenced [34].The reads are clustered according to terminal UMI combination, which reflects their original template, and a consensus sequence is generated for each UMI cluster.This removes PCR and sequencing errors [34] (Figure 1), while preserving the complete SNP haplotype of each input molecule.In highly repetitive and homologous regions such as the LPA KIV-2 repeats, this finally provides highly accurate consensus sequences of each repeat unit.
We describe a comprehensive assessment of UMI-ONT-Seq for SNP detection, SNP haplotyping, generation of consensus sequences for each VNTR unit and copy number determination by coveragecorrected quantification of the unique haplotypes, using the LPA KIV-2 VNTR region as an example for a medically highly relevant complex VNTR region.We created a scalable freely available UMI-ONT-Seq Nextflow analysis pipeline that can be generalized to also any other UMI-ONT-Seq experiment (https://github.com/genepi/umi-pipeline-nf)and demonstrate LPA KIV-2 haplotyping by UMI-ONT-Seq in

KIV-2 amplicons and reference sequence
All KIV-2 VNTR units were amplified using two amplicons that amplify all KIV-2 units as an overlapping amplicon mixture [24,25,30] (Figure 1A).The PCR5104 amplicon spans one KIV-2 unit and portions of the inter-kringle intron.The PCR2645 amplicon spans one inter-KIV-2 intron with the flanking exons and parts of the intra-kringle intron.Experimental conditions are given in Supplementary table 1 and Supplementary table 2. All positions and fragment lengths in this manuscript are based on the reference sequences for a single KIV-2 unit used in [24].

Human samples and KIV-2 ddPCR
Sixteen unrelated samples from the healthy-working population SAPHIR (Salzburg Atherosclerosis Prevention Program in subjects at High Individual Risk; sample codes EUR in figures and tables) [36] with KIV-2 SNP data from ultra-deep NGS from Coassin and Schönherr et al, 2019 [24] were used to evaluate UMI-ONT-Seq performance in human samples using amplicon PCR5104.The KIV-2 repeat number was quantified by ddPCR (Supplementary Methods, Supplementary table 4).Mean confidence interval (CI) width of ddPCR KIV-2 quantification was as low as 4.81±2.34KIV-2 copies.One sample was excluded due to technical failure.48

multi-ancestry samples (Yoruba [YRI], Dai Chinese [CDX], Japanese [JPT], Punjabi
[PJL]; 12 samples per group) were obtained from the Coriell 1000G samples repository.The 1000G samples are available at the Coriell Sample Repository (Camden, NJ, USA).The SAPHIR study was approved by the ethical committee of the Land Salzburg and all participants provided an informed consent.

KIV-2 UMI-ONT-Seq principle
UMI-ONT-Seq follows the approach developed by Karst et al [34] using the oligo design from ONT technical note CPU_9107_v109_revC_09Oct2020 (generating 3 16 diverse oligos; Figure 1B).The UMI primers consist of a 3' locus-specific primer (LSP) [24], the UMI and a 5' universal priming site (UVP)(Figure 1B).All PCRs were performed with the ThermoFisher Platinum SuperFi II ultra-high fidelity polymerase (accuracy >100-fold over Taq [37,38]).About 2 ng gDNA (i.e.50,000 KIV-2 template copies under assumption of maximum 80 KIV-2 genomic repeats; observation from our database with >13,000 apo(a) Western blots [39] and from [40][41][42]) were tagged with two UMI-PCR cycles (Figure 1C1) followed by New England Biolabs ExoCIP treatment.The tagged templates were then amplified in two rounds of PCR with UVP primers (early and late UVP-PCR) to produce multiple copies of each tagged molecule (Figure 1C2).A 0.9x SPRI beads purification between the two rounds of UVP-PCR was used to reduce background (Supplementary Methods).After full-length nanopore sequencing (Figure 1C3), reads are clustered based on the terminal UMI pairs and a consensus sequence is created for every UMI cluster with a defined minimal read number (UMI cluster size; Figure 1C4).Each consensus sequence represents the sequence of a single genomic KIV-2 VNTR repeat unit.

UMI-ONT-Seq Nextflow analysis pipeline
Initially, our analysis pipeline corresponded to the proof-of-principle UMI analysis pipeline published online by ONT [45] (which follows the pipeline steps of Karst et al [34]), but was migrated to the Nextflow framework with several optimizations to improve performance and parallelization.The pipeline steps are described in Supplementary figure 2 and in the Supplementary Methods.In brief, all reads of each barcode are merged, filtered for length (>1000 bp) and mean per-base quality (>Q9) and aligned to the target reference sequence.Only primary alignments with more than 95% overlap are retained to remove chimeric amplicons.The reads are clustered according to the terminal UMI sequences using vsearch [46] and only UMI clusters with ≥20 reads were retained for consensus sequence generation (as recommended by Karst et al. [34]).Each cluster is then polished using Medaka v1.Our analysis pipeline and test data is available at https://github.com/genepi/umi-pipeline-nfunder GNU General Public License v3.0.For reproducibility, we provide also the migrated ONT pipeline at https://github.com/genepi/umi-pipeline-nf/tree/default_clustering_strategy.

UMI-ONT-Seq residual error rate estimation
Sequencing the unmixed plasmids provides an intuitive way to estimate the residual error rate of the UMI-ONT-Seq as any variation in the consensus sequences can be considered an error.The error rate can be quantified by either averaging the Phred (Q-) scores of each consensus sequence ("consensus sequence Q-Score") and or dataset-wide ("dataset Q-Score").The latter was introduced because the fragment length limits the maximum achievable consensus-sequence Q-Score and because perfect consensus sequences produce infinite Q-Score.The dataset Q-Score was defined as Error rate = ndifferences/nsequences × lengthsequences with the Q-score being Q = -10 × log 10 (error rate) [47].
All reads were aligned to the reference sequence for one repeat unit and conventional low-level variant calling was performed using default Mutserve v2.0.0-rc15 [49] and ClairS-TO [50].Extraction of the intronic KIV-2 STR sequences and variants for each KIV-2 unit is described in the Supplementary methods.

Variant truth set
Ultra-deep targeted KIV-2 NGS data obtained previously for all SAPHIR samples [24,25] was used as truth set for UMI-ONT-Seq evaluation.The 1000G variant truth set was generated using a KIV-2 NGS variant calling pipeline on the 1000G 30X whole-genome (WGS) sequencing data [51] (with minor adaptions for WGS data).All UMI-ONT-Seq datasets were benchmarked in terms of sensitivity (true positive rate), specificity (true negative rate), precision (positive predictive value, i.e. the proportion of genuine variants among all variants found) and F1 score (harmonic mean of sensitivity and precision).For the polymorphic intronic short tandem repeat (STR) in PCR5104 (position 2472-2506) no NGS reference data was available and it thus was analyzed separately.

KIV-2 units haplotype extraction
Haplotypes were extracted using a three-step algorithm (available at https://github.com/AmstlerStephan/haplotyping-KIV2-nf). 1) Extraction of uniquely occurring haplotypes including all positions that were called as variants in any of the samples (unique haplotypes).2) Noise polishing and removal of "unlikely" haplotypes (merged haplotypes).3) Estimating the repeat number per haplotype by coverage correction (coverage-corrected haplotypes).
The unique haplotypes were determined by extracting from the consensus sequences of each sample the base present at any polymorphic position in the dataset, as found by mutserve variant calling.At positions per sample with a variant frequency <0.85 % only the major variant was used in the haplotype.
Next, only the uniquely occurring haplotypes per sample were retained, including their occurrence in the consensus sequences (unique haplotypes).
To obtain the merged haplotypes, the residual noise was reduced by clustering identical haplotypes and assigning each haplotype cluster below the threshold n sequences x 0.0085 to the haplotype cluster with the smallest edit distance, but not more than a maximum edit distance of 1. Next, assuming unbiased and random tagging of KIV-2 repeats in our samples and unbiased PCR amplification, we applied a binomial distribution to determine the minimal occurrence required for a haplotype to be considered genuine.The binomial distribution formula in this context is expressed as: Here, n represents the total number of haplotypes observed after the UMI sequencing, p is the probability of a haplotype to occur and k the minimal occurrence threshold for a haplotype to be considered genuine.The probability is set to 0 (stringent filtering) to calculate the sample-specific minimal occurrence threshold (k) for any given haplotype.Haplotypes falling below the determined minimal occurrence threshold k have 0 probability to be genuine (being e.g.generated by the residual noise in the UMI sequencing) and were therefore excluded from the analysis.
For all remaining haplotypes, the number of occurrences is adjusted for potential identical KIV-2 repeats by dividing the occurrence of each haplotype by the median occurrence (sample-wise) and rounding that to the next integer.This gives the coverage-corrected number of KIV-2 repeats (coveragecorrected haplotypes).

Results
We performed a comprehensive evaluation of the performance of UMI-ONT-Seq for variant calling, haplotyping, generation of consensus sequences for each VNTR unit and VNTR copy number determination in the complex LPA KIV-2 VNTR [15].We generated 28 sequencing libraries encompassing both PCR products of two unmixed plasmid standards differing by 87 (PCR5104) and 120 (PCR2645) bases, five plasmid mixtures (ddPCR-validated, Supplementary table 3) and two sequencing chemistries (R9, V14).
Moreover, the KIV-2-spanning PCR5104 amplicon was sequenced on 15 human validation gDNA samples and finally used to call mutations, map haplotypes, and quantify the genomic KIV-2 units in 48 1000G samples from 4 different populations.

UMI-ONT-Seq recapitulates expected mutation levels and KIV-2B haplotype fractions in plasmid mixtures
The switch from R9 to V14 chemistry, as well as basecalling V14 data with SUP or SUPDUP instead of HAC had, as expected, a large impact on sequencing accuracy (Supplementary  3A).Notably, no performance difference was seen for V14 data between HAC and the computationally much more expensive SUP basecalling.Specificity was 99.9 to 100% in all samples (Supplementary table 10), but a residual background noise at 0.2% to 0.5% variant level was observed in all plasmid mixtures and sequencing chemistries (Figure 2B, Supplementary figure 3B).Therefore, we introduced a cut-off of 0.85% for all further experiments, which includes all bonafide KIV-2 variants while allowing some technical variation.
Since the variants present on the two plasmids represent distinct haplotypes, we expected all mutations from the same plasmid to occur at the same level.We found that all mutations from the same plasmid were, indeed, close to the expected level on average, but showed considerable per-position noise when analyzed with the ONT UMI analysis pipeline using the default clustering strategy (Figure 2B).
Systematic analysis of the UMI clusters revealed that inaccurate clustering of the UMI sequences by vsearch resulted in partially heterogeneous UMI clusters (Figure 2C, Supplementary table 11).If the UMI clusters contained a mixture of sequences (e.g.KIV-2A and KIV-2B sequences), the cluster-polishing step produced noisy mutation levels (Figure 2B, Supplementary table 12).
Implementation of the cluster splitting strategy reduced the edit distance in the UMI clusters considerably (Figure 2D).This had no impact on variant calling performance in the V14 chemistry (Figure 2E, Supplementary table 13) and most importantly, mutations originating from the same plasmid now showed virtually no residual variation and matched the expected values very accurately (Figure 2F).The number of plasmid standards with no variant level noise increased from 4/14 to 12/14 samples in HAC and SUP.The median variant level noise was reduced by 3.1-fold and 2.3-fold for V14 HAC and SUP (R9 HAC: 0.8-fold, Supplementary table 14).This indicates that our cluster splitting strategy allows accurate recalling of the haplotype of each read (Supplementary table 15).All further results are based on the UMI-ONT-Seq analysis pipeline using the cluster splitting strategy.

UMI-ONT-Seq produces highly accurate KIV-2 consensus sequences at ≥6 reads per UMI cluster
We investigated the relationship between the UMI cluster size and consensus sequence qualities using the unmixed plasmids (KIV-2A and KIV-2B), where it can be assumed that any difference between the consensus sequences represents a PCR or sequencing error.
Up to 10 reads per UMI cluster, the dataset Q-score increased rapidly in the V14 data, reaching Q40 already at 6-10 reads per cluster (Figure 3A, Supplementary table 16; 14 reads for R9 data, Supplementary figure 4A).The consensus sequence Q-score increased in a similar manner as the dataset Q-score, reaching the maximal quality after 6 reads per cluster for the V14 chemistry (14 for R9 HAC; Supplementary figure 5A).At 6 reads per cluster already 96.8 % (HAC) and 98.3 % (SUP) of all consensus sequences showed no more than 2 errors, and 58.1 % (HAC) and 62.5 % (SUP) were even error-free (Figure 3B; R9: 79.6 % and 33.5 %; Supplementary table 17 and Supplementary figure 5B).

Accurate variant calling of variants within the KIV-2 VNTR in human samples
We evaluated the performance of UMI-ONT-Seq on the KIV-2 PCR5104 fragment, which encompasses about 92% of the KIV-2 VNTR region, in 15 human validation samples with KIV-2 batch NGS sequencing data available [24].Sensitivity, specificity, precision and F1 score were mostly very close or equal to 100% (Figure 4A; see Supplementary table 20 for single sample values and Supplementary figure 6 for R9 HAC).
Most importantly, specificity (mean± SD) was 1.0±0.001for all conditions, demonstrating a very low falsepositive rate of UMI-ONT-Seq despite its relatively high raw-read error rate (Supplementary table 21).Also in human samples V14 data performed generally better than R9, while V14 SUP provided marginal benefit over V14 HAC (Figure 4A).Importantly, despite providing considerably higher raw read accuracy (median read quality ≈Q28; Supplementary table 5), UMI-ONT-Seq with duplex basecalling (SUPDUP) leads to very low sensitivity and F1 score (0.481±0.298; 0.578±0.221;Figure 4B; Supplementary table 21).This was actually expected due to the specific technical implementation of duplex basecalling and is addressed in the Discussion section.
To evaluate the advantage of the UMI-ONT-Seq we called the KIV-2 variants on the same ONT-Seq data without UMI clustering (simulating a conventional KIV-2 deep sequencing approach [24]).The performance of the variant calling without UMIs increased continuously from R9 HAC to V14 HAC to V14 SUP to V14 SUPDUP (Figure 4C, Supplementary table 22), but precision and F1 score were considerably below UMI-ONT-Seq in all conditions.For V14 SUPDUP without UMIs, sensitivity and specificity reached 0.950±0.031and 0.971±0.008,but precision and F1 score were only 0.399±0.122and 0.554±0.123.UMI-ONT-Seq also considerably exceeded the performance of the nanopore-specific low-level variant caller ClairS-TO (Supplementary figure 8).
Importantly, UMI-ONT-Seq not only discriminated variants in the KIV-2 very efficiently, but also recapitulated very closely the individual variant levels measured previously by deep NGS [24] (R 2 : 0.977 and 0.981 for V14 HAC and SUP, Figure 4D, Supplementary figure 7).With V14 conditions, no bias was observed across the whole mutation level range (Figure 4E; Supplementary figure 9 for R9).Conversely, both the R9 chemistry and especially the UMI-free conditions showed noticeable bias and less correlation, which was exacerbated by a high number of false-negatives in the UMI-free analysis (Supplementary figure  The LSP primer site is used to tag specifically each KIV-2 repeat in a sample with a unique UMI sequence (1). .Subsequent amplification with a universal primer (UVP) (2) and sequencing (3) causes random errors that cannot be differentiated from genuine low-level variants (red boxes in 3).The UMI-sequences are used to cluster all sequences originating from one KIV-2 repeat unit and create molecule-wise consensus sequences.This removes errors that occur only in subset of sequences in each UMI cluster, but retains genuine variants that occur in a majority of the sequences in within each UMI cluster.Panel A: Performance of UMI-ONT-Seq compared to KIV-2 specific variant calling of high-coverage whole genome sequencing (WGS; ADD VNTR pipeline) for the 48 1000G samples.While sensitivity and specificity are close to perfect (mean ± SD: 0.993±0.01,0.996±0.005),precision and respectively F1 score deviate from the WGS data.All mutations that were additionally found by the UMI-ONT-Seq were previously classified as KIV-2B specific, intronic variants, which are reported to be difficult to call in WGS data [24].Panel B: Correlation of UMI-ONT-Seq variant levels versus WGS variant levels (n = 5968 variants).Found variant levels of both methods are highly correlated (r: 0.992, R2: 0.983).Deviations were observed only for KIV-2B specific variants.Panels C: Correlation of ddPCR measured versus UMI-ONT-Seq predicted number of KIV-2.We observed a high correlation between UMI-ONT-Seq quantified and true number of KIV-2 repeats (r: 0.851, R2: 0.724).Panel D: Comparison of UMI-ONT-Seq with ddPCR quantification.UMI-ONT-Seq accurately predicts the number of KIV-2 repeats.The mean (± SD) deviation between UMI-ONT-Seq and ddPCR was only -0.295 ± 4.26 repeats.For 32 of 48 samples even within 95 % confidence interval of ddPCR.
7.0.The UMI extraction, clustering and reference alignment steps are repeated for the polished consensus sequences to generate the final consensus sequences (clustering step 2).Extensive analysis of the migrated pipeline revealed inaccurate clustering by vsearch (see Results section for details).In both clustering steps, vsearch clustered distant UMI combinations and separated identical UMI combination into different clusters.We therefore modified the clustering strategy of the pipeline.Looser clustering parameters (80 % sequence identity) in clustering step 1 prevent separation of identical UMI sequences.To prevent mixing of distant UMIs into one cluster, all clusters containing more than 12 (R9 HAC), 10 (V14 HAC), 8 (V14 SUP) reads were then split by taking the first UMI sequence of the cluster file and clustering it with all remaining UMI sequences in the same file that show ≤2 bp edit distance (UMI collision probability <0.01 %).The remaining UMIs were treated as a new cluster and clustered iteratively.The minimal UMI cluster size required for consensus creation was derived based on plateauing of the consensus quality at these values (see Results).Subsequently, in clustering step 2 stringent clustering parameters (>99 % sequence identity) were used to remove PCR duplicates without mixing distinct UMI clusters.

Figures
Figures

Figure 1 :
Figure 1: Technical aspects: LPA structure, amplicon location and UMI design.Panel A: Partial LPA gene structure (second exon of KIV-4, KIV-5 to KIV-10 and the C-terminal protease domain omitted) and amplicon location.Panel B: UMI-ONT-Seq primer design, including a universal primer (UVP) binding site for amplification of the tagged KIV-2 repeats, the UMI sequence and a locus-specific primer sequence (LSP) (e.g.KIV-2 specific).The UMI sequence leads to about 43 million possible unique tagging sequences.Panel C:The LSP primer site is used to tag specifically each KIV-2 repeat in a sample with a unique UMI sequence (1). .Subsequent amplification with a universal primer (UVP) (2) and sequencing (3) causes random errors that cannot be differentiated from genuine low-level variants (red boxes in 3).The UMI-sequences are used to cluster all sequences originating from one KIV-2 repeat unit and create molecule-wise consensus sequences.This removes errors that occur only in subset of sequences in each UMI cluster, but retains genuine variants that occur in a majority of the sequences in within each UMI cluster.

Figure 2 :
Figure 2: Variant detection in plasmid mixtures with the ONT UMI analysis pipeline using the default clustering strategy and the cluster splitting strategy for the V14 chemistry and HAC basecalling.Panels A to C show the results for the default clustering strategy, panels D-F show the corresponding results for cluster-splitting strategy.Panels A and E: Performance measures for the default clustering strategy in recalling variants of the two unmixed plasmids and plasmid mixtures from 5 to 0.5 % KIV-2B in KIV-2A background of two fragments (PCR2645, PCR5104).Panels B and F: variant levels for the plasmid mixtures from 0.5 -5 % across every position of both fragments.Grey points: Low-level residual noise.Variation of detected variant levels of up to ±1 % in absolute values was observed (blue points, 5 % mixture).Panels C and D: Edit distance of the UMI-sequences for different UMI cluster sizes.Grey: single clusters.Red: Average cluster size.Using the default clustering strategy admixing of UMI-sequences up to an edit distance of 25 within one cluster (i.e. the sequences did not originate from the same KIV-2 repeat) led to considerable variance in the observed variant levels (panel C).Using the cluster splitting strategy with maximal edit distance 2 between the UMI-sequences (D) reduced the variant level noise considerably (E, colored points) for both fragments and all mixture levels.

Figure 3 :
Figure 3: Impact of the minimal cluster size on consensus sequence quality and error-profile.Panel A: Dependency of the dataset Q-score from the minimal cluster size.V14 HAC and V14 SUP Q-score increases rapidly being close to Q40 already at cluster size 6 to 10. Panel B: Percentage of perfect reads depending on the minimal cluster size threshold.At Q40 dataset Q-score 62 % of all consensus sequences are error-free and 98.5 % of all cluster had no more than 2 errors for both V14 conditions.Panel C: Error type frequency for V14 HAC and SUP.C to A and G to T transversions were the most common errors.":N" denotes insertions D denotes deletions.Panel D: Dataset Q-score at different cluster thresholds after filtering for the sequencing chemistry-specific errors.The dataset Q-score of V14 consensus sequences reaches Q50 already at a cluster size of 6 to 10.

Figure 4 :
Figure 4: Performance measures compared to ultra-deep NGS sequencing of 15 human DNA samples (SAPHIR) for the V14 chemistry (HAC and SUP basecalling).Panel A: Performance of UMI-ONT-Seq for the 15 human DNA samples.Black points are median values.Colored points are the single samples.We observed high agreement between the ultra-deep NGS sequencing and UMI-ONT-Seq for most samples, leading to median performance values above 95 % and slightly higher performance values for SUP basecalling.Panel B: Performance values for duplex basecalling (SUPDUP).Black points and lines are median values and interquartile range.Grey points are the single samples.Despite increased raw read quality there was a significant drop in sensitivity when using SUPDUP basecalling (see Discussion for explanation).Panel C: Performance values for ONT-Seq (without UMIs) for different chemistries (R9, V14) and basecalling algorithms (HAC, SUP, SUPDUP) (black points and lines are median values and interquartile ranges; colored points are the single samples).Sensitivity increased with increasing raw read quality up to median values of 95 % for SUPDUP basecalling, but precision and F1 score were consistently low due to by high number of false positives.Panel D: Correlation of variant levels for each mutation of all 15 DNA samples (black points) of UMI-ONT-Seq for V14 HAC and SUP basecalling compared to ultradeep NGS sequencing.We observed nearly 100 % correlation (r and R2 > 0.977) for both conditions, with no bias across the variant level range (E).

Figure 5 :
Figure 5: Correlation between the expected and UMI-ONT-Seq predicted number of KIV-2 repeat units.Panel A: Correlation of the coverage-corrected UMI-ONT-Seq haplotypes with the number of KIV-2 repeats expected from ddPCR.Coverage-corrected UMI-ONT-Seq haplotypes allow precise determination of the true number of KIV-2 repeats.Panel B: The barplots report the predicted versus measured number of KIV-2 repeats per sample including the confidence interval of the ddPCR quantification.75%-80% of the samples are within the 95 % confidence interval of the ddPCR.EUR: European samples from Austria (SAPHIR study).

Figure 6 :
Figure 6: UMI-ONT-Seq analysis of 48 samples from four populations of the 1000 Genomes Project.

Figure 7 :
Figure 7: KIV-2 subtype specific haplotype diversity of 12 representative human gDNA samples in the present study (1000G, SAPHIR).Splitting the samples by their containing KIV-2 subtypes, revealed 2 major clusters containing either only KIV-2A (panels A-D, KIV-2A repeat haplotypes marked in blue) or the phylogenetically distant KIV-2B (panels E-H, KIV-2B repeat haplotypes marked in green) and C (panels I-L, KIV-2C repeat haplotypes marked in red) subtypes.While the KIV-2B and C cluster contain very similar sequences, with low diversity, the KIV-2A cluster has large internal differentiation (see Supplementary figure 14 for all samples).

table 5
).For technical performance values across all experiments see Supplementary table 6 to Supplementary table 9. Variant calling performance of UMI-ONT-Seq was excellent in all plasmid mixtures down to 1% for both amplicons (Figure 2A, for R9 data see Supplementary figure