Abstract
Accurate typing of human leukocyte antigen (HLA), a histocompatibility test, is important because HLA genes play various roles in immune responses, and have also been shown to be associated with many diseases such as cancer. The current gold standard for HLA typing uses DNA sequencing technology combined with sequence enrichment techniques using specially designed primers or probes, causing it to be slow and labor-intensive. Although there exist enrichment-free computational methods that use various types of sequencing data, hyper-polymorphism found in HLA region of the human genome makes it challenging to type HLA genes with high accuracy from whole genome sequencing data. Furthermore, these methods are database-matching approaches where their output is inherently limited by the completeness of already known types, forcing them to find the best matching known alleles from a database, thereby causing them to be unsuitable for discovery of rare or novel alleles. In order to ensure both high accuracy as well as the ability to type novel alleles, we have developed a graph-guided assembly technique for classical HLA genes, which is capable of assembling phased, full-length haplotype sequences of typing exons given high-coverage (>30-fold) whole genome sequencing data. Our method delivers highly accurate HLA typing, comparable to the current state-of-the-art database-matching methods. We also demonstrate that our method can type novel alleles by experimenting on various data including simulated, Illumina Platinum Genomes, and 1000 Genomes data.
Introduction
Human leukocyte antigen (HLA) genes are crucial to the regulation of immune system as they encode for the major histocompatibility complex (MHC) consisting of cell surface proteins that control the adaptive immune response. HLA genes are also known to play important roles in transplant rejection, autoimmune disorders [1] and cancer [2, 3]. For these reasons, accurate HLA typing is important both in clinical and research settings. HLA typing is considered challenging because of the hyper-polymorphic nature of the HLA region in human genome. Such high polymorphism in the HLA region is thought to be maintained by strong balancing selection promoting genetic diversity [4, 5]. Especially with personal genome sequencing becoming widely common, better computational methods are needed to provide rapid and inexpensive typing with high accuracy.
Traditionally, HLA typing or categorization was done by more laborious serology-based methods to screen for HLA antibodies in a donor/receiver pair. With the birth of DNA sequencing and polymerase chain reaction (PCR), molecular typing assays such as specific oligonucleotide probe hybridization (SSOP), sequencespecific primer amplification (SSP), and sequence-based typing (SBT) have been developed [6]. The SBT method can be either used with Sanger sequencing or NGS techniques. By using specific primers to perform target enrichment prior to sequencing, SBT delivers accurate and reliable typing of HLA alleles. However, all of the above molecular typing assays require a specially designed set of probes/primers, and they are labor intensive, low throughput, and costly.
With an increasing availability of personal WGS services, an availability of accurate computational HLA typing methods that do not require additional experiments can be valuable. Challenges in computational HLA typing are mainly driven by the high level of polymorphism found in the HLA region in the human genome. There are over 30 genes that are maintained in the IPD-IMGT/HLA database [7] and there are 6-8 classical HLA genes (HLA-A, -B, -C, -DQA, -DQB, -DRB) routinely used for HLA typing in clinical settings. More than 15,000 known alleles (just for these classical genes) have been reported in the database and the number of alleles is growing rapidly (Figure 1). Also, the known alleles share high sequence similarities, where many alleles just differ by a base-pair substitution. Thus, it is challenging to correctly pinpoint an individual’s HLA types among the known alleles using WGS data [8].
Previously developed enrichment-free computational methods can use whole genome sequencing (WGS), whole exome sequencing (WES), or transcriptome sequencing (RNA-seq) without the use of HLA-enriched data as opposed to the SBT. However, they do not provide typing accuracy comparable to what the SBT provides [9].
These methods either use one or both of two major techniques—alignment and assembly—to accurately compare reads to correct HLA genes and infer allele types. Alignment-based methods try to correctly assign NGS reads to HLA loci then infer typing alleles using widely used probabilistic genotyping techniques. Assembly-based methods first construct longer-than-read contigs using de novo assembly techniques and search for best matches among known alleles and direct assembly of haplotypes by traditionally available de novo or reference-based assemblers are severely confounded by the high level of polymorphisms. PHLAT [10] uses probabilistic SNP calling techniques and also models phasing by looking at reads covering neighboring variant sites. HLA-VBSeq [11] uses variational bayesian inference to correctly assign reads to alleles based on read-to-allele alignment and outputs a sorted list of alleles by the number of reads assigned. HLA*PRG [12] performs an alignment of extracted reads likely coming from HLA region to population reference graphs [13] that encode all known alleles and outputs the most likely alleles from the database.
One common ground for the enrichment-free computational HLA typing methods is that they are all driven by the finding-the-nearest-match paradigm; their goal is to find the best matching alleles to HLA genes of a test individual in a preexisting database of known entries. Given sequencing data of an individual, such a typing scheme outputs the best matching alleles for each HLA gene. This typing strategy is limited by the completeness of the collection of known alleles as it cannot detect novel alleles missing in the database of known types. We collectively refer to these approaches as database-matching methods. Novel alleles can possibly have protein coding changes which may have more profound impact in the context of organ transplant and disease association. One might argue that there are already many known alleles and that the chance of finding novel alleles is low. However, the number of known alleles in the IPD-IMGT/HLA database is still increasing rapidly (Figure 1). There is continuing effort among immunogenetics communities to study rare and novel alleles. For example, the International HLA and Immunogenetics Workshop has been organizing projects to investigate and collect rare and novel alleles since their 15th workshop in 2002. Immunogenetics-related journals such as International Journal of Immunogenetics, and HLA (formerly known as Tissue Antigens) both have a dedicated section where new alleles are announced in every issue.
For these reasons, it is important to be able to recover HLA sequences at 1-bp resolution to enable novel allele discovery as similarly done in SBT. In order to achieve this goal, we present a graph-guided assembly technique called Kourami that constructs full sequences for the peptide binding domain (exons 2 and 3 for class I and exon 2 for class II HLA genes–regions typed by the SBT methods) by using a modified partial ordered graph (POG) [14] as a guide. Our method is the first method that directly assemble both haplotypes of HLA genes rather than to infer the best matching alleles in the database. For known alleles, we show that Kourami can correctly type with high accuracy (>98%), equalling that of the state-of-the-art database matching method, across various WGS datasets such as simulated data, Illumina Platinum Genomes, and high coverage WGS from 1000 Genomes project. At the same time, Kourami only takes a fraction of time compared to other available methods with a moderate use of memory.
In addition, Kourami is the first HLA typer to be able to assemble novel alleles that do not appear in the database. It does this by treating the HLA typing problem as an instance of graph-guided assembly, where the known alleles are combined into a graph that is used to guide the assembly of new alleles. Kourami therefore also represents an early example of how a population of reference sequences can be used during genome assembly. We systematically show that Kourami is very accurate in its ability to construct novel alleles by performing leave-one-out experiments where a known allele is artificially removed from the allele database. Kourami is able to reconstruct 98% of these alleles perfectly.
Materials and Methods
HLA typing nomenclature
The current HLA allele nomenclature [15] uses a hierarchical numbering system with 4 major levels of hierarchies. From the highest to the lowest category, it annotates allele groups (2-digit resolution), protein sequence (4-digit resolution), exon sequence (6-digit resolution), and intron sequence (8-digit resolution). For example, if two alleles encode an identical protein, they will have the same numbers for the first 2 levels (4-digit) of hierarchies. In practice, HLA typing is often carried out at either the protein or exon level. Furthermore, the current gold standard, SBT, types just the exons that are responsible for encoding the peptide binding domain (exons 2 and 3 for class I genes and exon 2 for class II genes). Using only the subset of exons creates ambiguous alleles where two or more alleles share identical sequence over these exons but differ in other regions. These ambiguous sequences are grouped as a 6-digit ‘G’ allele. Similarly, 4-digit ‘P’ grouping is used for the alleles that share same amino acid sequence over these exons. Our method provides fully assembled sequence covering these exons and also outputs 6-digit ‘G’ resolution typing result by selecting known alleles that have the smallest edit distance to the assembled sequences. Similar to many other HLA tools, we focus on the routinely typed classical genes (HLA-A, -B, -C, -DQA1,-DQB1, -DRB1).
Overview of method
Our method takes an advantage of partial order graphs to capture all known alleles and further modifies the graph to include variants found in the sequencing data for graph to include paths of true alleles. An overview of our method is illustrated in Figure 2, and the major steps are labeled from (a) to (e). More details are given in “Materials and Methods”. We first create a comprehensive reference panel from a combined multiple sequence alignment (MSA) of both full-length and exon-only known alleles for each HLA locus (step a). Reads mapped to all known HLA loci in the human genome reference (GRCh38) are extracted (step b) and aligned to the comprehensive reference panel (step c). Gene-wise partial-ordered graphs are constructed using the combined MSAs and the alignments of the extracted reads are projected onto the graphs so that each read alignment is stored as a path in the graphs and read depths on edges naturally become edge weights (step d). When these reador read-pair-backed paths connect 2 or more neighboring heterozygous sites of 2 alleles, they provide phasing information. During the alignment projection, the graphs are modified by adding nodes and edges to incorporate differences found by alignment such as substitutions and indels. Note that a sequence of an allele is encoded as a path through the entire graph. Finally, with the weighted graphs with alignment paths, we formulate the problem of constructing the best pair of HLA allele sequences as finding the pair of paths through the graph. When finding the pair, we consider consistent phasing information from the reads and coverage with a use of base quality scores. Additionally, the pair of paths may be identical to permit homozygous alleles.
Input alignment and extraction of HLA reads
Kourami takes alignment of WGS to the human genome as an input in the BAM format. For many experiments used here, we used pre-computed alignments downloaded from Eurepean Bioinformatics Institute and Google Cloud Platform. In case of missing alignment files, we follow the 1000 Genomes procedures (see the GRCh38DH alignment readme file available from the 1000 Genome FTP server) to align reads using BWA-kit v0.7.15 [16] and further process the bam files using other tools such as BioBamBam [17] and GATK [18].
From the alignments, we extract paired-end reads aligned to all known HLA loci in chromosome 6, alternate sequences (ALT) of extended MHC (xMHC) regions, and HLA sequences (the complete set of coordinates used is in Table S1 in Supplementary Materials) included in the Human reference genome (hs38DH packaged in BWA-kit v0.7.15). In the GRCh38 assembly, regions that exhibit sufficient variability are represented in the primary chromosomal sequence as well as the ALT loci scaffolds.
Known HLA alleles and construction of a comprehensive reference panel
The Immuno Polymorphism Database (IPD) maintains a periodically updated database of known HLA alleles in the IPD-IMGT/HLA database [7]. IPD-IMGT/HLA Release 3.24.0 (April, 2016) was used for all experiments here. A detailed breakdown of numbers of alleles included in this release is shown on Table 1. The other methods compared here use earlier versions of the database because the content of database is built into their software and there is no way to update or swap database at the user level. Using a later version of the database does not give advantages as long as the earlier version also contains the true alleles of testing individual.
Many alleles in the database only contain partial sequences, often just covering few exons responsible for peptide binding domain of HLA genes (Table 1). For this reason, the IPD provides a set of pre-computed multiple sequence alignments (MSA) of full length alleles (Mgene) and just the coding regions (Mcoding) separately for each HLA gene. Similarly to HLA*PRG [12], for each HLA gene, we combine these two MSAs by aligning them at corresponding columns in order to obtain a comprehensive reference panel of known alleles. This can help better recruit reads that span intron-exon junctions. The combined MSA (Mpanel) has the same number of rows as Mcoding. The number of columns in Mpanel is equal to the sum of the number of columns in Mcoding and the number of intronic columns in Mgene. For each row in Mcoding, if the allele for the row has a corresponding row in Mgene, intronic columns are inserted into Mcoding, otherwise, intronic columns of the reference allele in Mgene are inserted. In addition to the HLA genes that are included in the IPD-IMGT/HLA database, non-polymorphic HLA genes DQA2 and DQB2, paralogous copies of DQA1 and DQB1 and often regarded as poorly polymorphic, are added to the reference panel as decoys to filter out reads originating from them aligning incorrectly to other class II genes. In our analysis, we noticed that reads coming from DQA2 or DQB2 can make the assembly of typing exons of class II genes difficult as previously reported [10].
HLA-graph construction
In order to capture all information contained in Mpanel in a minimal manner as well as to allow flexibility to enable novel sequence discovery, we use partial order graphs, a compact graphical representation for MSA [14]. From each Mpanel, we can directly construct a gene-specific partial order graph similar to ones typically used in multiple sequence alignment [14, 19]. An example of a MSA of 3 known sequences (Mpanel) is shown in Figure 3(a). Each sequence is first drawn as a chain of vertices connected by directed edges (Figure 3(b)), where each vertex vi represents a base symbol bvi (bvi ∈ {A, C, G, T, N, –}) and is positioned at column i in the graph. For each column, vertices with an identical base symbol at a column are merged as a single vertex and duplicate edges are removed (Figure 3(C),(d)). The gap symbol (‘-’) is used to restrict edges to connect vertices only from consecutive columns in the input MSA. An edge between two vertices (evi, vi+1) exists if Mpanel has a row with consecutive bases bvi and bvi+1 at columns i and i + 1. It is important to note that this graph contains at least the same number of paths as the number of rows in Mpanel used to construct the graph. The graph often encodes a larger number of paths and such flexibility of the graph is the foundation which allows us to model this family of seqeunces and capture novel alleles. For example, a simple graph shown in Figure 3(d) encodes all sequences in the given MSA as well as AGGT-A, ACGTCA, and ACCTCA. Each path through the constructed graph encodes a possible allele.
Modification of the HLA-graph via alignment projection
Consider an example novel allele sequence of AGCTCA. It is easy to see that there is no path encoding such allele in the HLA-graph shown in Figure 3(d). In this example, simply adding an edge from the vertex G at column 2 to the vertex C at column 3 is the only modification needed for the graph to include the path that encodes the novel allele. If a novel allele exists in data, there must be sequencing read that contain the differences the nov.el allele has compared to known alleles. Assuming the sequence divergence is small enough for pairwise alignment of the read and a known allele to capture the differences, we can obtain the novel variants. For this reason, we further modify the HLA-graph to include additional paths that encode for novel alleles in a test individual. We achieve this by modifying the previously constructed HLA-graph by projecting the alignments of the reads likely coming from HLA region to known HLA genes.
We first align the extracted reads to the set of reference panel sequences obtained from Mpanel using BWA (v0.7.15-r1140) [16]. The linear alignments obtained are then projected onto gene-specific partial order graphs. That is if a read is mapped to the HLA-A gene, then the alignment is projected onto the HLA-graph of the gene. Given a read r, a subsequence h of a known allele H, and a pairwise alignment of r and h, by projection of the alignment to the HLA-graph, our goals are to (1) modify the graph to encode the exact sequence of r within the range of columns h is encoded in the graph, (2) increment the weight of each edge of the path by 1, and (3) preserve preexisting paths at the same time. When r and h are identical, the graph must already contain a path that exactly encodes r because H is in the MSA used to contstruct the graph. When there are few differences such as mismatch, deletion, or insertion identified by the pairwise alignment of r and h, there are two cases: (1) r is already encoded in the graph and (2) r is not, thereby requiring modification of graph to encode r. For example, ACGTCA does not align perfectly to any of the sequences in Figure 4(a) but it is encoded in the graph as a path. On the other hand, there is no path encoding ACCTGA.
Examples of graph modification by alignment projection are shown in Figure 4. The panel (a) shows a MSA with 3 known alleles and the corresponding POG. Modification for mismatches and deletions are simple because they only require adding a vertex for the mismatched base or a gap (‘-’) symbol. Figure 4(b) illustrates an example where r has a deletion of ‘T’ at position 4. A gap vertex is added to the corresponding column and edges are added to connect the newly added vertex to the previous and next base in r to obtain a path encoding r. Normally, insertion requires a shifting of columns in the MSA/graph because extra columns are required for the inserted bases to be encoded. However, some alignments with insertions do not require a column shifting. An example of an alignment with insertion not requiring a column shifting is shown in Figure 4(c). The read is aligned to H3 with insertion at position 5 instead of aligning to an allele H1 with a mismatch at the same position because the alignment score with 1 insertion is higher than the score with 3 mismatches (position 2, 3, and 5 if aligned to H1). Because of H1 is in the MSA, the graph already has the column for handling an insertion at this particular column. In this case, we simply insert a vertex with ‘G’ symbol into the corresponding column and connect edges to complete the path for r.
Finally, the case of an insertion requiring a shift of columns is depicted as an example in Figure 4(d). The read is aligned to allele H1 with an insertion of ‘A’ at position 4. To insert a new column between 3rd and 4th columns (also denote them as left and right columns), we first insert a new vertex with a ‘-’ symbol and need to reroute all edges between the left and right columns through the newly inserted gap symbol and redistribute edge weights in order to preserve the preexisting paths. Adjusted weights are shown on the edges in the example. To describe formally, let L and R be sets of vertices of the left and right columns respectively and E be the set of directed edges from vl ∈ L to vr ∈ R with the weight of each edge as w({vl, vr}). Additionally, let be the set of all outgoing edges of vl and be the set of all incoming edges of vr. Note that there are always 1 or more outgoing edges from vl and 1 or more incoming edges to vr. After disconnecting all {vl, vr} ∈ E, we make a new vertex vgap with ‘-’ symbol and add an edge {vl, vgap} for each vl and assign a weight of . Similarly, we add an edge {vgap, vr} for each vr with a weight of . Once the column shifting is done, we can actually process the insertion base exactly same as we handled the case of inserting into a gap column.
Finding paths through the HLA-graph
Given the HLA-graph with weights, assembling HLA alleles can be formulated as the problem of finding two (diploid) paths (they can be identical) that explain the read mapping data (weights and phasing) best. For example, the read depth value for an edge can be thought of as a capacity of the edge in classical flow problems. When considering only the weights, we can find two paths where the sum of the flow values of the paths are maximized. However, this formulation does not handle phasing information embedded by reads or read pairs, therefore it can possibly select erroneous paths that are not consistent with phasing information. For this reason, we want our objective to take both weights as well as phasing information into account. Since read information is embedded on the HLA-graph, we can check if two neighboring variant sites can be phased directly by a read or read pair. For example, given two heterozygous sites with A/T and G/C, a read or a read pair carrying ‘A’ followed by ‘G’ at these sites indicates the chromosomal phase of ‘AG’ since the sequencing read is assumed to come from a contiguous segment in a chromosome. In our method, we first investigate variant regions individually to select locally phased paths with strong read support and construct a set of full-length paths through the HLA-graph by connecting the locally phased paths that can be further phased by read or read pair. Each of these full-length paths is considered as a candidate allele and the best pair among the candidates with maximum read and phasing support is selected as the output. To only consider nonzero-weight full-length paths, we remove all zero-weight edges and disconnected vertices prior to finding paths.
HLA-graph to bubble graph
We first focus on the parts of the HLA-graph where variations are captured, which are often referred to as bubbles in sequence assembly graphs [20, 21, 22, 23]. In the HLA-graph, we define a bubble as a region (3 or more consecutive columns) where there is only one vertex each in the leftmost and rightmost columns and the rest of the columns must have 2 or more vertices. Let s and t be the vertices in the leftmost and the rightmost columns respectively. Any vertex in the bubble is reachable from s and one or more paths exists from any vertex in the bubble to t. Any two distinct paths that goes through a bubble must go through s and t. Bubbles naturally capture varying sites between two alleles in the graph. The regions that are enclosed by dotted line in Figure 5(a) are examples of bubbles. On the other hand, a region that is completely shared by all paths through the HLA-graph represents a conserved region. Without the loss of generality, the HLAgraph can then be thought of as a chain of bubbles, where two neighboring bubbles are connected by a linear path of length 0 or longer (Figure 5). For simplicity, we can connect the bubbles without the linear paths as they do not play any role in determining the phase of a haplotype. We call this a bubble graph and bubbles can easily be recognized in the HLA-graph because of its structure.
Finding the best set of paths in a bubble
Ideally, we want to find exactly 2 paths per bubble since the ploidy number is 2 for humans. However, bubbles may contain more than 2 paths because of sequencing errors or misalignment. Therefore, we first identify all paths that are phased by a read or read pair. For each bubble, we can use a modified breadth first search (BFS) technique to obtain all paths that go through the bubble. In order to avoid enumerating over all paths through a bubble, we prune any path without a read backing the sequence encoded by the path at each iteration of BFS. For a path in the bubble to be retained, it must be supported by at least one read phasing the entire path. We can simply compute the set of phased reads for a path by taking a series of intersections of read sets maintained by each edge in the path. Each phased path through a bubble is a called a bubble path. Given multiple bubble paths from a bubble, our goal is to select the best pair of paths. We iterate over all possible pairs of bubble paths to calculate the posterior probability of each pair given all reads aligned to the bubble to find the pair that gives the maximum probability. We write the posterior probability of a given genotype as where Gb is a genotype and D is the alignments of all reads aligned over the bubble. The genotype is a pair of bubble paths Gb = (Hb1, Hb2). Each d ∈ D is an alignment string of a segment of a read and di is the i-th symbol in segment d. Similarly, is the i-th symbol in Hb1. P(D) is constant and we assume that the prior probability P(Gb = (Hb1, Hb2)), is uniformly distributed over all genotypes. We can then compute the conditional probability P(D│Gb) by adopting widely used formulations [18, 24] with small variations to allow multiple positions and base ‘N’ case that can be present from sequence data. We iterate over each read and compute P(D│Gb) as a product of the conditional probability of each read d. Since a read must come from one of the two chromosomes, and we assume that d is equally likely to come from either one of them, we can rewrite it as a sum of average of two cases where d is from Hb1 and Hb2 and it is .To compute the conditional probability of each d given a bubble path Hb, we iterate over each pair of corresponding positions di and jointly, assuming each di is conditionally independent of each other given . Therefore, the probability of each base di given a pair of corresponding genotype bases and is . The probability of seeing a base given an allele is defined as where є of base symbol di is the error probability obtained from the phred score of the base. For the case of di=‘N’, we simply estimate the probability as 1/4. Instead of selecting Hb from all possible │d│-mers, we limit to only the bubble paths found in the bubble and iterate over all pairs to select a pair of bubble paths Pb that jointly gives the maximum probability:
Phasing paths
We now have an ordered list of bubbles, and a list of “best” read-backed phased bubble paths for each bubble. The goal here is to find a set of candidate paths through all the bubbles by merging one bubble at a time iteratively from left to right, connecting bubble paths that are phased by a read or read pair. Two paths are said to be phase-consistent if there is a read or read pair spanning both paths. This can be checked easily by taking an intersection since each bubble path maintains a set of phasing reads. Given a set of already merged bubble paths Ƥm from the first i − 1 bubbles and a set of bubble paths Ƥbi from the i-th bubble, we look at all pairs of paths Ƥm × Ƥbi and keep only pairs that are phase-consistent and connecting each of such pairs as one path. We also update the phasing-read set for each merged path.
Selecting the best pair of candidate alleles
Once the assembly by bubble merging is finished, we have a set of merged bubble paths through all bubbles. By placing back the linear chains that were ignored during bubble merging to original positions (in between bubbles), we have a full-length candidate allele Hi for each merged bubble path. Let C be the set of all candidate alleles and B be a set of all bubbles. Our goal is to select a pair of alleles (H1, H2) ∈ C × C that has the most consistent phasing support over all bubbles. We first define a scoring metric that checks for strength of phasing support jointly for a pair of allele H1 and H2 between a pair of consecutive bubbles bi and bi+1 and it is defined as where fi(H)is the inter-bubble phasing fraction. The fraction fi(H) is the ratio of the number of phasing reads for allele H between bi and bi+1 and the number of total phasing reads. When considering two paths at the same time, there can be regions where the paths overlap (homozygous: shown as purple edges in Figure 2(e)) and separate (heterozygous: shown as blue or red edges in Figure 2(e)). For homozygous sections of the paths, that is H1 = H2, fi(H) is halved to keep balance between calling homozygous and heterozygous alleles. We can calculate a product of Fi over all pairs of neighboring bubbles to check the consistency of phasing support for the pair. Finally, we select the pair of alleles Ƥ that maximizes the product over all pairs of alleles:
Description of data used for evaluation
Simulated data
For each of the 6 HLA genes (HLA-A, HLA-B, HLA-C, HLA-DQA1, HLA-DQB1, and HLA-DRB1) tested, we randomly selected 2 full-length alleles from the IPD-IMGT/HLA database (v3.24.0) and repeated this for 100 replicates, resulting in a total of 200 alleles to simulate. The exact number of full-length alleles in the database is reported in Table 1. For each replicate, we simulated 25X coverage of paired-end WGS data for each allele, making it 50X coverage for each locus. For the simulation of paired-end reads, we used an Illumina read simulator, pIRS [25], which simulates using empirical base-calling and GC%-depth profiles trained from Illumina re-sequencing of known samples. We used 2 × 100bp for the read length and 500 +/− 50bp for the mean and the standard deviation of the insert size.
Illumina Platinum Genomes
Illumina has sequenced 17 individuals (CEPH/Utah pedigree 1463) in a three generation family using their high-coverage PCR-free paired-end WGS assay (2 × 101bp). These genomes are often referred to as the Illumina Platinum Genomes [26]. The family pedigree is shown in Figure 6. Many individuals in this family are extensively investigated by the genomics community especially NA12891-NA12892-NA12878 trio as well as NA12898-NA12890-NA12877 trio. The read alignments to the GRCh37 version of the Human genome for all 17 individuals were downloaded from the Illumina Platinum Genomes page hosted on Google Cloud Platform (Table S2 in Supplementary Materials) and they were realigned to GRCh38 version of the Human genome.
1000 Genomes
The 1000 Genomes Project [27] has produced various personal genomic data. Among these, there are 11 individuals whose high-coverage WGS sequencing data along with validated HLA typing results [28] are available. This dataset covers a wide ethnic diversity (1 Colombian from Medellín, 3 Utah residents with Northern and Western European ancestry in a trio, 1 Japanese from Tokyo, Japan, 3 Yoruban from Ibadan, Nigeria in a trio, 1 person of African ancestry from the southwestern United States, 1 person of Mexican ancestry from Los Angeles, and 1 Toscani from Italy) covering various different HLA types, making it an ideal dataset to test on. The bam files aligned to the GRCh38 version of the Human genome were downloaded from the 1000 Genomes data portal (http://www.internationalgenome.org/data-portal). For the Utah resident trio and the Yoruban trio, we downloaded fastq files and realigned to the GRCh38 version because GRCh38 bam files were not available. The complete list of links to the data downloaded is in Table S2 in Supplementary Materials.
Results
Simulation
In order to check that our method performs well, we tested our method on simulated data (see “Materials and Methods”). For each of the 6 HLA genes, 2 alleles from the set of full-length gene sequence in the IPD-IMGT/HLA database were randomly chosen. We repeated for a total of 100 replicates, resulting in 200 randomly selected alleles across all replicates. For each replicate, we simulated 50X coverage (25X for each haplotype) of paired-end WGS data. We compared Kourami, PHLAT, and HLA*PRG on the simulated data. Our method was evaluated using all 1200 alleles (2 alleles × 6 genes × 100 replicates), however, not all alleles could be used for the evaluation of PHLAT and HLA*PRG as both tools use their own digested format of the HLA database and built it into the tools so that the content of the database cannot be updated by a user. The database versions used by PHLAT and HLA*PRG are older compared to the version (v3.24.0) used for Kourami. Given a set of WGS data of an individual with an allele that is not in the database built into PHLAT and HLA*PRG both tools will fail to type the allele correctly as they are designed strictly to find the nearest match among the known alleles. For this reason, evaluation of PHLAT and HLA*PRG are only based on the subset of simulated alleles (1011 for PHLAT and 990 for HLA*PRG) that are in the database versions they use. For PHLAT, 4-digit ‘P’ resolution was used and 6-digit ‘G’ resolution was used for HLA*PRG and Kourami for evaluation. Table 2 shows the number of correctly inferred alleles as well as the accuracy for each HLA gene tested. For our method, we report both the typing and assembly accuracy where an assembled allele is correct if the output sequence is identical to the true allele sequence (no mismatch or indel). Even when an assembled allele is not identical to its expected true sequence, the typing of the allele may be correct if the closest sequence (minimum edit distance) in the database is the true allele. PHLAT achieves 93.6% across all HLA genes tested (89.8% for Class I and 96.6% for class II). HLA*PRG and our method perform equally well, achieving 99.8% typing accuracy across all genes (99.5% for class I and 100% for class II). Additionally, Kourami achieves 99.3% assembly accuracy.
Novel Allele Detection
The major benefit of our method is that it can assemble novel alleles across the typing exons, therefore its typing capacity is not limited by known alleles as is the case with other database-matching methods. Unlike the database-matching methods, Kourami uses the known alleles in the input database only to construct the HLA-graph that serves as a template for reference-based assembly but does not discriminate between the paths that encodes known alleles and novel alleles.
In order to demonstrate the ability to assemble novel alleles, we evaluated Kourami across various data where ground truth is known. We tested on the simulated data and the real data with previously validated HLA types (NA12878-NA12891-NA12892 Platinum trio and 11 samples from the 1000 Genomes Project) with a modified database of known alleles so that Kourami is not aware of true allele sequences. For each sample, we randomly selected one allele from each of the 6 HLA genes and removed the selected alleles from the reference MSAs (full-length and exon-only) provided by the release 3.24.0 of the IPD/IMGT-HLA database. When removing an allele, we removed all entries in the ‘G’ group that the allele belongs to and the entire list of alleles removed from each individual is shown in Tables S3, S4, and S5 in Supplementary Materials. We removed corresponding rows for the alleles from Mgene and Mcoding and generated new . The number of ‘G’ group alleles removed is and the bam files obtained were used as inputs to Kourami. Note that this experiment cannot be done with PHLAT and HLA*PRG as the database of known alleles are built into the tools.
Kourami correctly assembled 98.3%, 100%, and 98.3% of the removed alleles for simulation data, the Platinum trio, and 11 samples from the 1000 Genomes Project respectively (Table 3). Among 1000 Genomes samples, the only incorrectly assembled allele (supposed to be B*38:01:01) had 1 base-pair difference to the correct sequence. When the 59 correctly assembled allele sequences are aligned to all alleles in , many alleles were aligned equally well to a large number of known alleles. For example, C*05:01:01 alleles aligned to 122 other alleles with just 1 base-pair substitution. Among them, a significant portion contained base differences that result in protein-coding changes in typing exons. This shows that the databasematching methods such as PHLAT and HLA*PRG are not only unable to discover novel alleles but also faced with a problem of selecting the best out of many alleles with equally similar sequences.
Illumina Platinum Genomes
Platinum trio with validated results
Among the Illumina Platinum Genomes, we first ran Kourami, PHLAT, and HLA*PRG on the trio (NA12891, NA12878, and NA12892) with the previously validated 4-digit HLA types for 6 HLA genes (HLA-A, HLAB, HLA-C, HLA-DQA1, HLA-DQB1, HLA-DRB1) [29]. Kourami and HLA*PRG perfectly called the correct types where PHLAT missed a call in the HLA-C gene in NA12891. In a previously published article [12], PHLAT called all 12 alleles correctly and the difference may be due to the fact that in our evaluation all software were run on the set of reads that aligned to xMHC/HLA region of chromosome 6 and unmapped reads. Extraction of subset of reads by read mapping location and including unmapped reads are common practice to reduce computational time, and a similar technique was used in [9].
Trio consistency and inferred haplotypes
The pedigree of Illumina platinum genomes include many third generation offspring and only the top right-hand trio in Figure 6 has previously validated HLA typing results. Since this trio includes the mother (NA12878) of all third generation offspring, if HLA typing results are trio-consistent across all trios and all second-generation-haplotypes are present in one of the children, we can theoretically infer HLA haplotypes of the second-generation male (NA12877) as well as the half of HLA haplotypes in the first-generation individuals (NA12889 and NA12890).
We tested all 3 methods to determine whether predictions are trio-consistent across all trios (trio consistency shown in Table 4). Kourami and HLA*PRG agreed on all 204 alleles at 6-digit ‘G’ resolution and the predicted alleles were trio-consistent and inferred haplotypes across HLA genes (intra-gene phased) are shown in Figure 7. PHLAT’s predictions were trio-consistent only for HLA-C and HLA-DQB1 when evaluated at 4-digit ‘P’ resolution, and additionally for HLA-A when evaluated at 2-digit resolution. Although, we do not know the true HLA types for the rest of 14 individuals, it is very likely that the predicted HLA types are correct given that all typing results are consistent. Low trio-consistency ratios for PHLAT in Table 4 is mainly due to mistyped alleles in HLA-A and HLA-B in the NA12877 individual. Assuming the predicted HLA types for the pedigree are correct, no recombination seems to have occurred, leaving no disruption in ancestral haplotypes. In Figure 7, we labeled the haplotypes that are originating from the first generation members as paternal-grand-father1/2 (PGF1, PGF2), paternal-grand-mother1/2 (PGM1, PGM2), maternal-grand-father1/2 (MGF1, MGF2), and maternal-grand-mother1/2 (MGM1, MGM2). The haplotypes that are passed to second generation individuals are numbered ‘1’ to keep the numbering consistent in the 3rd-generation. Among 11 third-generation offspring, all 4 possible pairs of haplotypes were observed (2 PGF+MGF, 2 PGF+MGM, 4 PGM+MGF and 3 PGM+MGM).
1000 Genomes
We tested all three methods on this data set and the result is summarized in Table 5. PHLAT called 93 out of 122 alleles correctly, resulting in 76% accuracy when evaluated at 4-digit ‘P’ resolution, and 89% when evaluated at 2-digit resolution. HLA*PRG results were consistent with what was previously reported [12], resulting in 1 error (99.2% accuracy). Our method correctly called all of the alleles without any differences in bases. Note that the total number of alleles tested for DQA1 is 12 instead of 22 (2 alleles x 11 individuals) because the validation data for 1000 genomes [28] does not report DQA1 types. DQA1 type validation is only available for 6 individuals [29].
CPU and memory usage
Kourami is able to assemble and type HLA alleles given WGS data in a fraction of the time compared to the state-of-art methods such as PHLAT and HLA*PRG with a moderate use of memory. We compared the CPU and memory usage using the WGS of NA12878 from Platinum Genomes data (2 x 101bp 55x). All tests were run on the input of the reads aligning to xMHC region and unmapped reads. HLA*PRG was the slowest–taking 54.62 CPU hours, while PHLAT took 10.73 CPU hours and Kourami only took 0.09 CPU hours (611x speedup compared to HLA*PRG). HLA*PRG required the most amount of memory–consuming peak memory of 78.9 Gbytes, while PHLAT and Kourami used 3.6 Gbytes and 4.3 Gbytes respectively. HLA*PRG requires many more CPU hours and a larger amount of memory usuage because of the expensive dynamic-programming-based alignment to graph. Kourami relies on fast NGS aligners to align reads against known alleles first and project obtained alignment to HLA-graph to significantly reduce the computational time without sacrificing assembly or typing accuracy.
Discussion
We have shown that our HLA assembly method can accurately reconstruct both haplotypes that span the typing exons of HLA genes by using a graph representation of known alleles as a guide, and the produced haplotype sequences can be used to successfully carry out HLA typing given high coverage (> 30-fold) paired-end WGS. WGS carried out for other analysis can be used to type individual’s HLA types without the use of another experiment (SBT and other molecular assays).
Most notably, the ability to discover rare and novel alleles is achieved by taking an advantage of the flexibility of POG, combined with graph modification and it is instrumental in both research and clinical settings. It is important to note that previously available computational methods using non-targeted sequencing data cannot discover novel alleles because they are designed to find the best matching allele among the known alleles. Especially with continuously decreasing cost of obtaining a personal genome, personal WGS will only become more widely available, and our method can deliver accurate HLA typing without additional experiments and cost. Also, Kourami is able to assemble and type at 6 digit ‘G’ resolution at a fraction of the time compared to other methods with a moderate amount of memory usuage.
One limitation of our method is that it only supports WGS as it needs enough reads to cover both haplotypes for each typing locus, and does not work on other NGS assays, such as WES or RNA-Seq. Since WES is being used widely, as the cost for WES is lower compared to that of WGS, it is useful to be able to type HLA genes using WES. However our testing (not shown) shows that it is difficult to accurately assemble a full length sequence across the typing exons with WES because there are regions for which no reads are sequenced. This may be due to biases that WES has been reported to have [30] as well as decrease in effectiveness in detecting variants when using WES compared to WGS [30, 31]. Additionaly, high-coverage WGS is required to ensure accurate HLA assembly or typing. We randomly sampled coverages of 20x, 30x, and 40x from NA12878 data (Illumina Platinum Genomes) for 5 replicates and tested Kourami on these samples. Assembly and typing stays accurate down to 30x coverage (accuracy of 0.97 across the HLA genes) but at 20x coverage, the accuracy drops to 0.82 (Supplementary Table S6). This should not be a surprise as haplotype-resolved assemblies of human genomes used ≈100x coverage of NGS data [32, 33].
Highly accurate results from our method signifies the recent advancement in handling genetic variation using graph structures to encode variations found in multiple reference genomes [34, 35, 36, 13]. Specifically in Kourami, the minimal representation of POG allows the consistent graph modification via alignment projection and this in turn enables capturing of novel alleles as paths through the graph. At the same time, it reduces computational time greatly without scarificing accuracy, and this can be beneficial when used in high-demand clinical settings. Our approach can also be extended as a general method of using graph structures as guide to reference-based assembly of high diversity gene families.
Availability and Implementation
Kourami is open source and freely available at https://github.com/Kingsford-Group/Kourami. It is implemented in Java and supported on Linux, Mac OS X, and Windows.
Acknowledgements
We thank S. Kim of Illumina for helping us in the early stage of this research. We would also like to thank D. DeBlasio, C. Ma, G. Marçais, N. Sauerwald, M. Shao, B. Solomon, T. Wall, H. Wang, and H. Xin for valuable discussions and comments on the manuscript. This research was funded in part by the Gordon and Betty Moore Foundation’s Data-Driven Discovery Initiative through Grant GBMF4554 to C.K., by the US National Science Foundation (CCF-1256087, CCF-1319998) and by the US National Institute of Health (R01HG007104).