ABSTRACT
Oxford Nanopore Technologies’ MinION has expanded the current DNA sequencing toolkit by delivering long read lengths and extreme portability. The MinION has the potential to enable expedited point-of-care human leukocyte antigen (HLA) typing, an assay routinely used to assess the immunological compatibility between organ donors and recipients, but the platform’s high error rate makes it challenging to type alleles with clinical-grade accuracy. Here, we developed and validated Athlon, an algorithm that iteratively scores nanopore reads mapped to a hierarchical database of HLA alleles to arrive at a consensus diploid genotype; Athlon achieved a 100% accuracy in class I HLA typing at high resolution.
MAIN TEXT
The Oxford Nanopore Technologies’ (ONT) MinION is a portable device the size of a mobile phone that performs rapid single-molecule sequencing.1 This device directly records dynamic changes in electric current across a nanopore as a single-stranded DNA or RNA molecule is ratcheted through the pore by a motor protein. The raw signals from hundreds of working nanopores are converted to sequencing reads in real time via an online base caller. The portability of this system, combined with its superior read lengths of up to 50 kb2, make the MinION uniquely positioned to enable point-of-care clinical sequencing, especially for applications that require haplotype information.
Advances in flow cell design and base-calling algorithms have led to steady improvements in the MinION’s raw read accuracy, which was as low as 66% in early 20143 and is now approximately 92%4. However, despite initial successes in diagnostic microbiology,5-10 the relatively high error rate and the lack of a dedicated variant caller for diploid genomes have prevented the MinION from achieving widespread use in human DNA sequencing.3, 11 Here, we report a method for the targeted nanopore sequencing of class I HLA genes and a bioinformatic pipeline to interpret these data (Athlon; http://github.com/cliu32/Athlon). Together, these provide a robust, cost-effective approach for the typing of class I HLA alleles in human samples with clinical-grade accuracy.
There are thousands of unique HLA proteins expressed in the human population. These diverse proteins present antigens on the cell surface for immune recognition and constitute the major barrier to allogeneic transplantation. HLA typing is critical for the evaluation of immunological compatibility between organ donor and recipient pairs. Although rapid, high-resolution HLA typing would be ideal for organ allocation12, 13, this has not been possible due to the technical limitations inherent to both Sanger and second-generation sequencing platforms.
The sequences of all known HLA alleles are deposited in the IPD-IMGT/HLA database14. Each HLA allele is named by locus (e.g. HLA-A) followed by an asterisk and up to four, colon-delimited numeric fields (Fig. 1a). The first field groups together HLA alleles that encode antigens sharing key serological epitopes. The first and second fields describe groups of alleles that encode the same unique protein. If synonymous mutations are present in any exons, a third field is appended to the allele name, and a fourth field can be added to describe sequence variation in non-coding regions. This comprehensive nomenclature system also allows new alleles to be named and organized hierarchically as they are discovered over time.
The antigen recognition domains (ARD) of class I HLA genes are encoded by exons 2 and 3, the most diverse regions of these genes. These two exons are the most informative for matching donor and recipient, so for hematopoietic stem cell transplantation, HLAs are typed and matched at least at these exons. This is referred to as the G-group level typing.15 For solid organ transplantation, HLAs are routinely typed at 1-field resolution, but typing at a higher resolution such as the 3-field, G-group level will provide added benefit.12 Therefore, to evaluate the performance of MinION-based HLA typing, we defined success as correctly identifying the one (for homozygous samples) or two (for heterozygous samples) alleles at the 3-field, G-group level that best matched the sequencing data.
We generated three datasets for this study (Online Methods and Supplementary Table 1), by amplifying class I HLA genes using long-range PCR with or without sample barcodes, followed by ligation of sequencing adaptors and multiplexed MinION sequencing. The consensus sequences from the template and complement strands (2D reads) from each run were demultiplexed based on sample barcodes or primer sequences before analysis (Online Methods and Supplementary Fig. 1). Average read lengths consistent with the predicted amplicon sizes ranging from 3.0 to 4.3 kb were achieved across these datasets (Supplementary Table 2). We used one of the datasets that was generated with the earlier R7.3 flow cells (WASHU-T) to develop and train the Athlon pipeline. The two remaining datasets, WASHU-V and DKMS, which were generated with the improved MK1 R7.3 and R9.4 flow cells respectively, were used to evaluate the performance of Athlon.
Since HLA alleles are extremely diverse, mapping reads to any single reference sequence will result in the loss of relevant reads that differ significantly from the chosen reference. To circumvent this problem, Athlon performs two rounds of read mapping to identify candidate alleles and then build consensus sequences from these alleles. First, Athlon maps locus-specific reads to all HLA alleles at the 3-field, G-group resolution (Online Methods) using BLASR, an aligner originally designed to map long and error-prone reads from the PacBio platform.16 One or two candidate alleles are then identified based on coverage statistics and an algorithm outlined below (see Fig. 1a-c). Second, all locus-specific reads are realigned to the candidate alleles to generate one or two consensus sequences, which are then queried against all reference alleles in the database to identify the best match as the final typing result (Fig. 1d).
To identify the candidate alleles used to generate the final consensus sequences, Athlon represents each HLA locus by a tree structure, treating the typing fields as branching nodes and leaves in order to identify alleles with the highest read coverage (Fig. 1a). For example, the HLA-A locus has 3311 leaves, 2546 nodes, and 21 nodes at the 3-, 2-, and 1-field typing levels respectively (Fig. 1a). Read coverage at the leaves under each node is summed to provide a total coverage value for each 2-field node and then for each 1-field node. The nodes and leaves are then sorted at each level by total coverage in descending order (Fig. 1b). Next, candidate alleles are identified by first selecting the top-ranked nodes at the 1-field level and then selecting the highest-ranked 2-field nodes that are connected to these 1-field nodes. The highest-ranked 3-field leaves that are connected to the selected 2-field nodes are then chosen as the candidate alleles (Fig. 1b). For the last step of Athlon, all reads are remapped to the selected candidate alleles, and the final consensus sequences are blasted against the reference database to identify the closest allele(s) as the final typing result (Fig. 1d).
To differentiate homozygous versus heterozygous genotypes, Athlon uses cutoffs based on normalized coverage to determine whether to consider a second allele at the one-field and two-field levels. The optimal cutoff values were empirically determined using the WASHU-T training dataset (n=15), which included three homozygous and 27 heterozygous samples. We quantified the coverage of 2nd-ranked 1-field and 2-field nodes in the three homozygous samples, and the coverage data were normalized using values from the highest ranked nodes as denominators (Online Methods and Supplementary Fig. 2). The mean plus two standard deviations of the normalized coverage of 2nd-ranked nodes was approximately 15% and 40% of the top-ranked allele, which were used as the statistical thresholds for calling a second node (i.e. a heterozygous typing call) at the 1-field and 2-field resolutions respectively (Fig. 1c). With these values for coverage thresholds, we were able to successful classify all homozygous and heterozygous samples in the training dataset. No threshold was applied at the 3-field level because alleles that are heterozygous beyond the first two fields encode the same protein.
Although the training dataset was collected with earlier versions of R7.3 chemistry, which is significantly more error prone than the later R7.3 and current R9.4 chemistry, we achieved typing results that were 96.7%, 83.3%, and 70% concordant with the truth at 1-, 2-, and 3-field resolutions respectively (Fig. 2a). We next used Athlon to type another low accuracy dataset previously published by Ammar et al., which also used the earlier R7.3 chemistry (n=2, a total of 4 alleles typed at 2-field resolution). The Athlon pipeline correctly typed three alleles at the 2-field level (a 75% accuracy), while the original analysis based on the GATK HLACaller was completely discordant at the same resolution (Fig. 2b), demonstrating that Athlon significantly improves HLA typing accuracy over the only existing method.
We next sought to evaluate the performance of Athlon using samples sequenced with updated R7.3 and R9.4 flow cells (Supplementary Table 1). We first analyzed the WASHU-V dataset (n=9, including four homozygous samples). For these samples, Athlon was 100% accurate at all three resolutions (Fig. 2c). To validate these results on a larger number of samples, we analyzed the DKMS dataset (n=30), which was collected from a single multiplexed MinION run. Again, the final allele calls were 100% concordant with the ground truth at 1-, 2-and 3-field resolutions (Fig. 2d). Additionally, the number of mismatches between the final predicted consensus sequence(s) and the closest reference allele(s) was quite low for the samples in these two high-quality datasets relative to the earlier, error prone, WASHU-T dataset (Fig. 2e). These results demonstrate that the Athlon pipeline performs HLA typing with clinical-grade accuracy when samples are sequenced with updated R7.3 and R9.4 flow cells.
To determine the maximum number of samples that can be multiplexed on a flow cell, we down-sampled the number of MINION reads used for our analysis and investigated the impact on Athlon’s accuracy. We explored a range of 15-800 reads per sample for the WASHU-V dataset and a range of 15-400 reads per sample for the DKMS dataset. For the WASHU-V dataset, Athlon was 100% accurate at the 1-field level throughout the range of down-sampling. At the 2- and 3-field resolutions, Athlon was 100% concordant if 400 or more reads were input per sample (Fig. 2f). For the DKMS dataset, Athlon was 100% correct at the 1-field level with as few as 25 reads per sample, and only 100 reads per sample were required for a 100% concordance at the 2- and 3-field resolutions (Fig. 2g). These results suggest that up to 600 loci can be typed on a single flow cell, assuming that least 60K total reads can be obtained from each sequencing run (Supplementary Table). Extrapolating from these results, we estimate that more than 100 individuals can be typed for class I HLA genes using a single MinION flow cell. Furthermore, when fewer reads are used per sample, significant reductions in computation time can be achieved (Fig.2f,g, and Online Methods). Taken together, these results suggest that rapid, point-of-care clinical sequencing can be performed cost-effectively through sample multiplexing.
In summary, we have developed and validated a MinION-based method that allows for the accurate typing of highly polymorphic class I HLA genes. One-hundred percent accuracy was obtained for 3-field G-group level typing, a resolution that is suitable for clinical applications. This approach paves the way for point-of-care HLA typing, which would greatly expedite organ allocation since it could be initiated at the bedside of deceased donors, or in an outreach laboratory. Additionally, the library preparation for the MinION does not require the fragmentation of long amplicons, which is regularly applied to HLA typing with second-generation sequencing platforms. Bypassing this step simplifies the workflow, reduces bias,17 and results in a more uniform read coverage (Supplementary Fig. 3). Moreover, nanopore sequencing can be readily scaled up to much larger numbers of pores18 and does not require a significant capital investment for equipment. In conclusion, MinION-based sequencing has the potential to spark a paradigm shift in HLA typing in contemporary clinical and research laboratories.
METHODS
Methods, including statements of data availability and any associated references, are available in the online version of the paper.
AUTHOR CONTRIBUTIONS
Contribution: C.L. designed and conducted experiments, constructed the bioinformatic pipeline, analyzed and interpreted the data, and wrote the paper; F.X. performed data analysis and interpretation, worked on an alternative method; J.H. conducted experiments; K.L. conducted experiments and analyzed data; P.Q. analyzed data; B.D. contributed to specimen collection and experiments; R.M. supervised the overall project, designed experiments, analyzed the data, and wrote the paper.
COMPETING FINANCIAL INTERESTS
C.L. and R.D.M were participants of the MinION Access Program and received the initial MinION instrument and flowcells free of charge.
ONLINE METHODS
Datasets and study design
Three datasets, WASHU-T (for Washington University-Training), WASHU-V (for validation) and DKMS (for German Bone Marrow Donor Center), were generated and analyzed in this study (Table S1). The WASHU-T dataset used five genomic DNA specimens, four from the 1000 Genomes projects and one from an deidentified homozygous donor, to prepare five multiplexed libraries. Each library included fragmented amplicons from three HLA class I genes (see below) and was sequenced by the R7.3 flow cell chemistry in one run. The WASHU-T dataset was used to develop and train the Athlon pipeline.
The WASHU-V dataset used three genomic DNA specimens, two from the 1000 Genomes projects and one from a deidentified homozygous donor, to generate three libraries. Each multiplexed fragmented amplicons from three HLA class I genes and was sequenced by the updated MK1 R7.3 flow cell chemistry in one run. An external validation was performed using the DKMS dataset, which was generated by multiplexed sequencing of 30 barcoded locus-specific samples in one run using the R9.4 flow cell chemistry. Detailed information including software and protocol versions are provided in Supplementary Table 1. All specimens in the WASHU-T, -V and DKMS datasets were typed by one or more reference methods, including Sanger and Illumina sequencing. Moreover, to benchmark against a method reported previously by Ammar et al.,19 the dataset from that report, which included two locus-specific samples (Supplementary Table 1), were analyzed using the Athlon pipeline. The typing results were compared with the truth and typing results reported from the original study. Only 2D reads were analyzed for all datasets due to their lower error rate.
Target amplification
Three class I HLA genes, HLA-A, -B, and -C, were amplified separately in full length by long-range PCR. Primer sequences are listed in Supplementary Table 1. The primers and PCR conditions used for the WASHU datasets were reported previously.20 Long range PCR of the DKMS samples was performed in 96 well plates with 4 µl template DNA, 12.5 µl 2x GoTaq® Long PCR Master Mix (Promega, Madison, USA) and 1 µl of a target-specific primer mix (50 µM each) in a total volume of 25 µl. A thermal profile of 95°C for 3 min followed by 25 cycles at 95°C for 15 s, 62°C for 30 s and 68°C for 7.5 min, and a finishing step at 68°C for 15 min was used. A subsequent PCR was applied for barcoding where a 5 specific adaptor sequence of the target-specific primers was used as template for the barcode introducing primers. This PCR was performed in 96 well plates where 2 µl of the target specific PCR (PCR1) was mixed with 12.5 µl 2x GoTaq® Long PCR Master Mix (Promega, Madison, USA), 8.5 µl of Nuclease-Free Water (Promega, Madison, USA) and 1 µl of an index primer mix (20 µM each); a thermal profile of 95°C for 3 min followed by 7 cycles at 95°C for 15 s, 55°C for 30 s and 68°C for 7.5 min, and a finishing step at 68°C for 15 min was used. Target-specific and indexing primers were designed in-house. All primers were obtained from metabion (metabion international AG, Planegg, Germany).
Library preparation and nanopore sequencing
For the WASHU-T and -V datasets, PCR amplicons of HLA-A, -B, and -C from the same genomic DNA source were pooled and purified with a 1.0x reaction with Agencourt Ampure XP Beads (Fisher Scientific NC9959336). DNA was then treated with PreCR Repair Mix (NEB M0309) to repair any damaged template DNA. Following a second round of 1x ampure clean-up, libraries were prepared with the standard protocol supplied by ONT using the version MN004 or MN006 library preparation kits. Libraries #1 through #5 were sequenced for 48 hours on the R7.3 flow cells on the original MinION devices (WASHU-T dataset). Libraries #6, #7, and #3 (repeat) were sequenced for 12 hours on the MAP103 (R7.3) flow cells on the newer MK1 MinIONs (WASHU-V dataset). Sequencing and base calling were performed using the MinKnow and Metrichor software available at the time (Supplementary Table 1), and the native FAST5 files were converted to the fastq format using Poretools.21
For the DKMS dataset all barcoded amplicons were pooled at 10 µl each and purified initally using SPRIselect Beads (Beckham Coulter, Brea, USA) in a ratio of 0.7:1 beads to PCR product. 1.5 µg of the purified pool was used for ONT library preparation with the ligation based 2D Library Preparation Kit (SQK-LSK208) in combination with an R9.4 SpotON Flowcell (FLO-MIN106). Library Preparation took place according to the manufacturer’s protocol and sequencing was done for 48h on a MK1B MinION. Sequencing and base calling were performed using MinKNOW version 1.1.21 and Metrichor version 1.125. The native Fast5 files were converted to FASTQ files using Poretools.21
Demultiplexing
The WASHU-T, -V and Ammar datasets were demultiplexed based on the locus-specific primer sequences. The python script is provided in the supplementary package online (http://github.com/cliu32/Athlon). The DKMS dataset was demultiplexed using lastlopper.pl (https://github.com/gringer/bioinfscripts/blob/master/lastlopper.pl), a wrapper script for the LAST aligner (http://last.cbrc.jp/)
HLA nomenclature and result evaluation
HLA alleles are named by the gene name (e.g. HLA-A) followed by an asterisk and up to four digital fields separated by colons. One-field typing corresponds to a group of alleles carrying one or more shared serological antigens. HLA alleles encoding the same unique protein share the same 2-field typing. If synonymous mutations are present in the exons, a third field is appended to the shared 2-field typing to form unique 3-field typings (Figure 1A). For alleles sharing the same 3-field typing but with sequence variations in the non-coding regions, a fourth field is added to form unique 4-field typings. The G-group code was used to describe alleles with identical sequences across exon(s) encoding the peptide binding domains (exons 2 and 3 for class I and exon 2 for class II alleles). To represent a group of such alleles, a capitalized "G" is suffixed to the allele designation of the lowest numbered allele to form the group name, e.g. A*01:01:01G.
For solid organ transplantation, HLAs are largely typed at 1-field resolution as a standard of care, although typing at a higher resolution may have added benefit.12 For hematopoietic stem cell transplantation, HLAs are at least typed at the G-group level of resolution.15 With these considerations, we aimed to type HLAs at up to the 3-field resolution using the G-group nomenclature. The reference sequences were constructed accordingly to serve this goal (see below). For the WASHU-T, -V, and DKMS datasets, all samples were typed at 3- to 4-field levels by reference methods, the accuracy of nanopore typing results were evaluated by the concordance rates at 1-field, 2-field, and 3-field G-group levels respectively. For the Ammar dataset, because the samples were typed at 2-field level by the reference method, the accuracy of nanopore typing results were evaluated by concordance rate among all alleles at 1- and 2-field levels respectively.
Reference sequences
Reference sequences for this study were constructed based on the IPD-IMGT/HLA database release 3.26.0.14 The .dat file was parsed using Biopython, and sequences for exons 2 and 3 were joint by a large intronic gap filled by "-" as space holders. The final reference file included 3311, 4163, and 3034 sequences for the HLA-A, -B and -C loci, respectively. A separate set of reference sequences without the intronic gap were used for the second read mapping to candidate alleles followed by consensus generation and blast.
Main procedures and components of the Athlon pipeline
The rationale and workflow of the pipeline are detailed in the Results section. A script for the pipeline, reference files, and sample data are provided at http://github.com/cliu32/Athlon. All read mapping were performed by BLASR (version 2.0.0)16 with the “hitPolicy” set as “randombest”. Read coverage was quantified using Bedtools (version 2.25.0) 22 with the default setting. Samtools (samtools and bcftools, version 1.2 using htslib 1.2.1)23 was used with default setting to manipulate bam files and generate consensus sequences. Final allele call was made using Blast (version 2.2.31)24 with the default setting to identify the closest reference allele to a consensus sequence.
Downsampling experiment
The WASHU-V and DKMS datasets were downsampled to the number of reads per sample as indicated using the seqtk package version 1.0-r31 (Li, https://github.com/lh3/seqtk) followed by analysis using the Athlon pipeline.
Calculation of quality metrics
The coverage depth at each base position was obtained using Bedtools (Version 2.25.0) for every candidate allele in the DKMS dataset, which was downsampled to 400 reads per sample. The coverage was visualized for each locus. The uniformity of coverage was represented by the coefficient of variance (CV) calculated as the standard deviation of coverage depth at each position across exons 2 and 3 divided by the mean coverage. The allelic balance was calculated as the total coverage of the minor candidate allele divided by that of the major candidate allele in each sample (n=30).
Data availability
Raw nanopore reads are available from the authors upon request.
ACKNOWLEDGMENTS
We thank Vineeth Surendranath and Vinzenz Lange for their helpful discussion and manuscript suggestions. C. Liu is support by the Washington University Hematology Scholars K12 award (K12-HL087107-07 to C.L.). R. Mitra is supported by grants from the NIH (U01MH109133, R01NS076993) and Children’s Discovery Institute (MC-II-2016-533).