## Abstract

In humans and many other species, while much is known about the extent and structure of genetic variation, such information is typically not used in assembling novel genomes. Rather, a single reference is used against which to map reads, which can lead to poor characterisation of regions of high sequence or structural diversity. Here, we introduce a population reference graph, which combines multiple reference sequences as well as catalogues of SNPs and short indels. The genomes of novel samples are reconstructed as paths through the graph using an efficient hidden Markov Model, allowing for recombination between different haplotypes and variants. By applying the method to the 4.5Mb extended MHC region on chromosome 6, combining eight assembled haplotypes, sequences of known classical HLA alleles and 87,640 SNP variants from the 1000 Genomes Project, we demonstrate, using simulations, SNP genotyping, short-read and longread data, how the method improves the accuracy of genome inference. Moreover, the analysis reveals regions where the current set of reference sequences is substantially incomplete, particularly within the Class II region, indicating the need for continued development of reference-quality genome sequences.

The current paradigm for analysing human genomes using high throughput sequence (HTS) data is to map to a single haploid reference sequence in which there is no representation of variation^{1–3}. Across much of the genome, such exclusion has little effect on the accuracy of genome inference because of the relatively low genetic diversity of humans. However, for some regions, such as the major histocompatibility complex (MHC) on chromosome 6, which contains the human leukocyte antigen (HLA) genes, there is very substantial sequence and structural variation^{4}. Such diversity can result in poor genomic characterisation in individuals who carry sequence that is either missing or highly divergent from the single reference. Other locations of high diversity include the KIR^{5} region, olfactory gene clusters^{6}, ancient inversions such as that on 17q21.31^{7–9} and regions of recurrent genomic rearrangement^{10}, many of which have substantial influence on phenotype and disease risk. In many of these cases, multiple alternative haplotypes have been characterised and are available. For example, there are eight alternative MHC haplotypes in the human reference (GRCh37^{11}). More generally, sequencing projects have greatly advanced our understanding of human genetic variation^{12–14}; using such information to help characterise human genomes represents an important and unsolved problem.

The problem of the single reference approach and the potential of using known variation to characterise the MHC is demonstrated in Figure 1 for a single individual (CS1). When the standard reference (carrying the PGF MHC haplotype) is used for mapping, large fluctuations in coverage and substantial read mismatches are observed (Fig. 1a, b). However, when a more appropriate reference is used, (here identified by comparing the classical HLA genotypes of the sample with those of the eight reference haplotypes and noting that one of the eight haplotypes was a close match), read coverage and alignment is greatly improved (Fig. 1c, d).

In order to use prior information about variation there are five main challenges. First, there needs to be a data structure for representing genomic variation that can accommodate multiple sources of information, from high quality assembled reference sequence (such as the ALT paths in GRCh37) to the lower-quality, but very extensive, catalogues of variation such as the 1000 Genomes Project^{12,14}. Second, there have to be algorithms that match high-throughput sequencing (HTS) data from a subsequent sample to the reference structure in order to best characterise the haplotypes present. Third, and potentially simultaneously with step two, there have to be methods for detecting additional variation not yet represented in the reference data structure. Fourth, because most functional information (such as the location and structure of genes) uses the coordinates of a single primary reference, there has to be a coherent way of projecting information from a variation-aware reference onto a primary sequence. Finally, there need to be methods to validate and compare the output from a variation aware reference tool-chain to the genomic information provided by existing approaches that rely on a single reference genome.

To date, limited progress has been made on addressing these challenges. Graph structures of local sequence variation and/or prior knowledge of small-scale variants are used to aid read assembly by several variant-calling algorithms^{15–17}. However these make no attempt to build a re-usable reference structure and do not incorporate long or diverged haplotypes such as in the MHC. The pragmatic approach of appending alternative sequences to the end of a reference has been proposed^{18} but cannot solve the fundamental problems of distinguishing between sequence similarity arising through orthology and paralogy. Conversely, the approach of identifying a best reference from among a set has been applied to Arabidopsis^{19}, but cannot address the problem that novel genomes are likely to be closest to a mosaic (arising through recombination) of those already known. Methods have been developed to represent multiple aligned genomes in a manner that allows inexact matching^{20}, but these are impractical for human genomes because of memory requirements (greater than 1Tb of RAM). Similarly, progress on variation aware data models has been made^{21}, though with no implementation. None of these methods represent a general and practical solution to the problem of describing and using information from multiple reference data sets.

Here, we present a solution to the challenges described above. We describe an approach for representing known variation called a population reference graph (PRG) and a series of algorithms that enable characterisation of the genomes present in an individual from HTS data. We build on previous work for using coloured de Bruijn graphs for analysing sequence variation^{22}, but also take advantage of the existing tool chain for read mapping and variant calling^{3,15}. To demonstrate the value of the method we develop a PRG for the MHC region and combine simulation with analysis of empirical data on SNP genotypes, classical HLA types, short-read and long-read Moleculo data from high coverage samples.

## The population reference graph

A population reference graph (PRG) is a directed acyclic graphical model for genetic variation that is generated by combining information about known allelic relationships between sequences (Figs. 2a, 2b). The graph is constructed in three steps (see Supplementary Text for a detailed description). First, reference sequences are aligned using standard multiple sequence alignment (MSA) methods^{23,24}. Second, a graph structure is generated from the MSA by collapsing aligned regions with sequence identity over a defined kmer size. Third, additional SNP information, which is encoded in VCF and related formats through a reference position and alternative sequences, is added to all paths with matching sequence at a given position (for example, a SNP cannot be added to a path where there is a deletion at a given position). Here, all REF and ALT sequences from GRCh37 are used, along with the catalogue of SNPs from the Phase 1 release of the 1000 Genomes Project and the set of classical HLA allele sequences from the International Immunogenetics Information System (IMGT^{25}) at key HLA Class I and Class II loci (Table S1). The resulting graph structure can be thought of as a generative model for genome sequences. From a limited set of input sequences, many different paths through the graph are possible, thus mimicking the effect of recombination. In regions of high diversity, the size of the state space can become very large (Fig. S1).

## Using the PRG to infer individual genomes

The development of HTS technologies in humans has largely relied on the notion that the genome(s) of the sample(s) in the experiment will be closely related to those of the reference, thus enabling reads to be mapped accurately and with appropriate certainty. We extend this idea by attempting to infer the (diploid) path through the PRG that most closely resembles the two haplotypes of the sample. Specifically, by comparing the HTS data from a sample to the PRG we construct a diploid personalised reference genome, here referred to as a chromotype. To infer novel variation not yet present in the PRG, we map reads to the chromotype and use existing variant calling software^{15}.

Chromotypes are inferred by considering the HTS data from a diploid sample to be emitted by a pair of paths through the PRG and approximating the emission process so as to benefit from the computational efficiency of hidden Markov model techniques. Briefly, HTS data is summarised by the counts of each string of length k (kmer). Similarly, the set of kmers that can be emitted from the PRG is enumerated, eliminating those that occur multiple times across the genome and that are therefore uninformative for local sequence inference (Fig. 2c). Finally, by using a probabilistic model for the emission of kmers (see Methods), the Viterbi-algorithm infers the maximum-likelihood (ML) pair of paths (chromotype) through the PRG (Fig. 2d). Note that this approach does not preserve any phase information across regions where paths merge. The diploid path is thus best understood as a bifurcating/merging sub-graph of the PRG, where heterozygous sites induce bubbles.

To detect novel variation within the sample the inferred chromotype is decomposed into two sequences (with no attempt to establish phase between adjacent bubbles in the chromotype), which are used to replace the homologous region in the primary reference. Reads are mapped to the two resulting reference genomes and each read is placed at its best position across the two reference genomes, as measured by mapping quality. A standard variant caller^{15} is used to discover new alleles independently in the two mappings and a heuristic algorithm modifies the chromotype accordingly. We have also developed an approach for mapping reads directly to the chromotype, which is important for validation using long-read sequences (see below) though currently too slow for primary analysis at the scale of millions of reads.

## Validation and comparison to other methods

To assess the value of the PRG approach in characterising variation within samples we used simulations and empirical data analysis. We compare four approaches to characterising variation.

As a base-line we use a single reference (the PGF haplotype within the MHC region from GRCh37) and look at the effect of calling a sample as everywhere homozygous-reference.

We use a read-mapping approach (Stampy

^{3}followed by Platypus^{15}) in which the components were designed explicitly for high sensitivity detection and genotyping of short INDELs and clustered variants.From the PRG, we assess the Viterbi chromotype.

From the PRG, we assess the results from mapping reads back to the Viterbi chromotype with BWA

^{1}and calling using Platypus^{15}(i.e. the modified chromotype).

The output of each approach can be represented as a chromotype, thus enabling comparison between PRG and single-reference based methods.

### Simulations

We simulated high coverage HTS data (85bp error-free paired-end reads from a 30x genome) for 20 individuals. The primary effect of read errors is to reduce kmer coverage, hence their omission. Each simulated diploid genome consists of two random paths through the PRG for the extended MHC (xMHC). The simulated genomes carry a mixture of recombination events between the original eight MHC haplotypes, SNPs and structural variants of varying size (insertions and deletions from 1 – 125,000bp). We assess the accuracy of the PRG approach through genotype concordance of the inferred paths through the PRG with the simulated paths (Table S2). Across all positions, 99.86% of alleles are correctly recovered. Accuracy at heterozygous SNP positions is similar (99.83%) and drops slightly for INDEL positions (ranging from 95.76% to 99.97%, Figs. 3a, 3b).

### Experiment 1: Comparison to SNP array data

To assess the ability of the PRG approach to genotype variation at sites of high uniqueness within the genome, we compared metrics of accuracy at SNP positions independently interrogated through array genotyping and HTS (1 sample [NA12878] at 60x coverage with 100bp paired end-reads and 5 clinical samples [CS2-6] at 30x coverage with 90bp paired-end reads; see Methods).

The accuracy of all approaches is high (Fig. 3a), ≥97.38% concordance with the Illumina Omni 2.5M array (NA12878) and ≥99.53% allele concordance with the Illumina 1M array (CS2-6) for all approaches. Comparing the array genotype concordance of Platypus-generated genotypes and PRG-generated genotypes, we find that both approaches yield comparable accuracies (97.75% vs 97.45% for the 2.5M array and 99.57% vs 99.66% for the 1M array, Fig. 3c, Table S3).

Of the 285 sites at which the array genotypes for NA12878 disagree with the Viterbi chromotype, in 55 cases this difference is driven by the Viterbi chromotype specifying at least one gap character suggesting the presence of an indel that could interfere with array genotyping. We have manually inspected the reference alignment of NA12878 reads for these sites, and find clear evidence for the presence of at least one deletion in 33 of the 55 cases (we provide visualisations of read mapping at all positions in Supplementary File 1). These findings suggest that a significant fraction of the discrepancy between array and PRG approaches results from array errors at polymorphic indels.

### Experiment 2: Comparison to classical HLA data

In regions of high sequence diversity, such as the classical HLA alleles, single-reference mapping and variant calling methods may perform poorly because of the density of mismatches to the reference. To assess the accuracy of different methods at the classical HLA alleles, we compared the per-base diploid genotypes inferred by mapping and PRG approaches to those expected from the results of sequence-based typing (SBT) of the highly polymorphic exons of Class I (*HLA-A*, *-B* and *–C*) and Class II (*HLA-DQA1*, *-DQB1* and *–DRB1*) genes in NA12878 and CS2-6. We analysed agreement with the corresponding allele reference sequence for the reported allele (in HLA nomenclature this means XX:XX:01 or XX:XX:01:01 at 6 or 8 digit resolution respectively)

Where diversity is relatively low and sequence coverage is very high, the accuracy of mapping-based approaches is high; for example 100% concordance for NA12878 at exon 2 of *HLA-A* and *–C* (Fig. 3d, Table S4). However, when coverage is lower, or sequence divergence and / or complications from paralogy are greater (e.g. for *HLA-B* in the CS2-6 samples or *HLA-DRB1* in all samples), the PRG approach typically outperforms mapping (e.g. 97.19% concordance with the PRG genotypes versus 89.85% concordance with mapping-based genotypes at *HLA-DRB1* in the CS2-6 samples). Mapping reads to the chromotype has a marginal effect on accuracy (typically < 1% gain).

### Experiment 3: kmer recovery from high coverage samples

Central to the use of the PRG in assembling individual genomes is the notion that it contains the majority of sequence that is likely to be found in any sample. In the absence of full and independent de novo assemblies, we can assess the extent to which any given chromotype is an accurate representation of the sample by measuring the recovery of kmers from HTS data. We apply this benchmark to NA12878 and the CS2-6 samples.

Across the 4.75 Mb xMHC region, the PGF reference contains 4.52M unique kmers of which 4.8% are not recovered in the HTS data from NA12878 (Fig. 4a). The mapping-based approach predicts 4.94M kmers, of which 1.2% are not recovered while the two PRG approaches predict 4.98M and 4.97M kmers respectively and 0.63% and 0.57% are not recovered. Results are comparable though slightly lower for all methods in the CS2-6 samples (Table S4). Consequently, the PRG approach both predicts greater sequence diversity than the mapping approach and achieves a higher rate of sequence recovery. Although the majority of the xMHC is accessible to all methods, there is substantial spatial heterogeneity in the rate of kmer recovery by all methods (Fig. 4b). Particularly in the HLA class II region, the PRG approach outperforms mapping approaches (Fig. 5), consistent with knowledge of genomic complexity involving the *HLA-DRB* paralogues. We also note that in some regions, in particular distal to *HLA-DRB5*, all approaches perform poorly in terms of kmer recovery (Fig. 5), suggesting that current catalogues of sequence within the xMHC are substantially incomplete.

Within the classical HLA loci, all methods perform well for class I loci, recovering 98-100% of kmers compared to 80-95% from the PGF reference haplotype (Fig. S2). At Class II loci, however, the advantage of the PRG method is pronounced, with approximately 99% of all kmers recovered for *HLA-DQA1*, *-DQB1* and *-DRB1*, compared to 88-95% for Platypus (against a base-line of 37-85% for the PGF reference haplotype).

### Experiment 4: Comparison to long-read Moleculo data

To assess alternative strategies for genome assembly over longer physical distances, we analysed high coverage long-read Moleculo data (25x coverage) from NA12878 (see Methods). We first identified 29,429 reads (median read length 3,165 bp) likely to have arisen from the MHC region through the presence of diagnostic kmers (see Supplementary Text), then aligned reads to the chromotypes generated by each approach. Read-to-chromotype alignment was performed with a Needleman-Wunsch-like alignment algorithm that aligns to graphs instead of sequence, implemented using dynamic programming (see Supplementary Text). We measure the scaled edit distance between reads and the chromotype (measured as non-identical characters in read to chromotype global alignment, divided by read length in kmers) as an indicator of genome accuracy.

We find that the mapping-based approach achieves the highest number of read alignments with zero mismatches (11,338 vs 10,071 for the modified PRG method). However, both PRG approaches result in significantly fewer reads with many mismatches and/or gaps (Fig. 6a, Table S5). For example, the total number of alignment columns indicating a deletion in the chromotype decreases from 1,017,231 (mapping-based) to 586,852 (modified PRG chromotypes). Likewise, the number of reads with very bad alignments (more than 150,000 gaps in the aligned read or ≥33% of the aligned chromotype string consisting of novel gaps) decreases from 303 to 134. The modified PRG chromotype has a modest benefit over the Viterbi chromotype, increasing the number of perfectly mapped reads from 8,359 to 10,071. Across the *DRB5* region (identified from the kmer recovery analysis as being most poorly represented by the PRG) we find reads that suggest the presence of an inversion relative to known sequence (Fig. 6b).

## Discussion

We have presented several innovations that address the problem of how best to represent and use information about known genetic variation in the assembly of HTS data from novel samples. These are an approach for representing information through the population reference graph (PRG), a practical method for reconstructing personalised reference genomes by comparing HTS data to the PRG, a method for detecting variation not present in the PRG and a series of benchmarking tests that enable comparison of methods, whether PRG-based or otherwise, which can complement other benchmarking approaches^{26}. Our approach is both modular (such that progress can be made on each element independently) and maintains the ability to project information onto a canonical primary reference that is used as the basis of most functional annotation.

In constructing the PRG, our approach has been to combine multiple sources of reference variation information including GRC37, the 1000 Genomes Project^{14} and IMGT^{25}. Importantly, we make no claim that the information contained within the graph is a full or accurate description of the variation within the region. For example, the 1000 Genomes Project variant list is estimated to have a false discovery rate of 1-5% depending on the variant type. However, the presence of false variants within the PRG should have relatively little effect on accuracy, because paths through such variants will have no support in HTS data. The aim of the PRG approach is not to describe all human variation but simply improve genome assembly through representing the diversity of sequence that may be present in an individual. Importantly, we evaluate the approach using metrics that are independent of whether a PRG or simple reference approach is used (kmer recovery, genotype accuracy at known SNP and HLA variants and long-read alignment). Future versions of the PRG could, however, more closely match observed patterns of variation, for example by removing paths never observed and weighting paths by population frequency. However, whether such changes improve genome inference would have to be evaluated empirically. What the approach has demonstrated is that graph based methods offer benefits in regions of high sequence diversity and that there are substantial stretches in the Class II HLA region of the MHC that are poorly characterised by all existing reference material, indicating the need for ongoing collection of reference variation data.

The current implementation has several limitations. First, by summarising information as kmers, we lose longer range information, particularly from read pairs. Likewise, the process of separately mapping reads to each of the pair of chromosomes within the inferred reference chromotype is *ad hoc* and inefficient. Both features arise from our attempts to maximise the usability of the PRG approach, notably efficient inference using HMMs and the desire to use as much of the existing tool-chain as possible. Both problems could also be solved potentially by using richer data structures that retain longer range information in both reference variation data and short-read data, for example the FM-index structure of SGA^{27}, augmented de Bruijn graphs that retain short-range path information or mapping directly to the graph. Nevertheless, despite such limitations, we have demonstrated that in regions of the MHC with high structural and sequence diversity, the use of the current implementation can improve genome assembly.

A more fundamental limitation of local graph approaches is that they fail to use the much longer-range information arising from haplotype structure (linkage disequilibrium). Sharing of SNP haplotypes over megabase scales is common even among samples of unrelated individuals, hence such information, which is used in applications such as imputation^{28–31} and refinement of low-coverage sequencing data^{32}, has the potential to further improve genome inference. Such information could, however, be included as a prior structure on paths through the PRG, for example in a generalisation of the HMM approaches used by imputation methods^{28–31}.

## Methods

A full description of the PRG algorithms can be found in the Supplementary Text, including (i) the algorithms used to build PRGs from a set of reference data, (ii) the algorithmic and statistical methods for inferring a best diploid path (chromotype) through the PRG, (iii) the algorithm to discover novel variation not presented in the PRG, (iv) the graph-mapping algorithm used for the contig analysis. Data used are as follows. CS1 and CS2-6 samples: germ-line DNA was extracted from peripheral blood samples collected from consented clinical trial subjects, previously determined to have evidence of a Class II HLA risk marker for drug induced liver injury^{33}. DNA was fragmented and size selected to create 2 × 180 base pair (bp) libraries and 2 × 800 bp libraries. These libraries were sequenced on a HiSeq 2000 to generate 90bp paired end (PE) reads at the BGI (Shenzen, China). For the CS1 sample approximately 200 Gb, and for each of the 5 samples in CS2-6 approximately 100Gb, of sequence was generated. For Fig. 1, CS1 data were initially aligned to GRCh37 on the CLC Genomics Workbench (version 6.5.1) and coverage and intact and broken PE read numbers determined for ∼180 kb surrounding *HLA-DRB1*. This process was repeated utilizing GRCh37 with the addition of the MANN alternative MHC haplotype. For all remaining analyses on CS2-6, reads were mapped to GRCh37 using Stampy^{34} and BWA^{1} and variants were called using Platypus 0.1.8^{15}. Read data for NA12878 from the Illumina Platinum genomes project (HiSeq 2000, ∼60x coverage, 100bp paired-end reads; www.illumina.com/platinumgenomes/) was obtained from the EBI (www.ebi.ac.uk/ena/data/view/ERP001775). Reads were aligned to GRCh37 using BWA 0.6.2^{1} and variants were called with Platypus 0.1.8^{15}. Moleculo data is available from the 1000 Genomes Project ftp site; ftp.1000genomes.ebi.ac.uk/vol1/ftp/technical/working/20131209_na12878_moleculo.

## Supplementary Methods

In this document, we give precise descriptions of the algorithms for population reference graphs (PRGs).

### High-level summary

We describe the following subsequent steps:

Construction of Population Reference Graphs (PRGs) to represent multiple genomes:

How to create a nucleotide PRG from a catalogue of variation, consisting of scaffold haplotypes for the whole region to be modeled and additional variant specifiers.

Genome inference:

How to create a kMer-PRG, a kMer-emitting object with identical haplotype structure to nucleotide PRG.

How to compress the kMer-PRG in a way that reduces its size and improves its statistical properties for the model we describe (the compressed kMer-PRG is called “multi-PRG”, because states of the model can now emit multiple kMers).

How to define and parameterize a Hidden Markov Model (HMM) “on top” of the multi-PRG.

How to use the HMM to infer a sample’s chromotypes, and how to represent them in VCF format.

Novel variant detection:

How to take pairs of inferred haplotypes (from the HMM on top of the multi-PRG) and use them as a basis for conventional mapping technologies, enabling the discovery of variants not present in the original catalogue of variation.

Graph alignment and validation:

How to align sequence to PRGs, using a modified version of the Needleman-Wunsch algorithm.

Experimental details:

The protocol followed to create a PRG for the extended human MHC.

Sample data, HLA types and availability.

## 1 Constructing a nucleotide PRG

We define a population reference graph as a directed acyclic graph with one designated start vertex and a set of designated final vertices. There is a *level function* that returns a positive integer for each vertex; all edges are defined between vertices of consecutive levels; and all final vertices are of the same level. All edges are labeled either with a nucleotide or a gap symbol, and that label is emitted upon traversal. Each node has a (potentially improper) probability distribution over the edges emanating from it.

The algorithm for building PRGs is based on a catalogue of scaffold haplotypes and additional variant specifiers (catalogue of variation, COV). Scaffold haplotypes span the region to be modeled, additional variant specifiers define allelic variation in the context of one or more of the scaffold haplotypes. SNPs, for example, are typically included as additional variant specifiers.

Informally, we will construct a PRG that contains all scaffold haplotypes as paths, allowing for recombination between a set of haplotypes whenever there is a stretch of identity between the members of the set. Additional variant specifiers are represented as “bubbles” on top of the scaffold haplotype graph.

PRG definitions and generative algorithm:

Let the directed connected graph G consist of the set of directed edges E and the set of vertices V, i.e. G = (V, E). For all e = (v

_{x}, v_{y}) ∈ E, v_{x}∈ V and v_{y}∈ V; we call e the directed edge from v_{x}to v_{y}. For PRGs, we require that there is a well-defined*level function*function l(v) for each node v, according to the following definition. There is one and only one vertex v_{0}with no incoming edges and l(v_{0}) = 0. This vertex is called the start vertex. For every (v_{x}, v_{y}) ∈ E, we define l(v_{y}) = l(v_{x}) + 1. All vertices with no outgoing edges are called “final vertices”, and we require that the level function returns the same value L for all final vertices.Each edge e ∈ E is labeled with a nucleotide (A, C, G, T), a gap symbol (“_”) or possibly a wildcard character (“*”), which stands for any nucleotide. At each node n, there exists a (possibly improper) edge probability distribution over the edges emanating from that node, specifying the probability P

_{n}(e) to follow edge e, conditional on being at node n.(For notational clarity, we do not explicitly consider the case here that there can be multiple edges between two nodes with different labels, or with the same label. However, all definitions made in this document are easily extended to cover these cases.)

To generate a haplotype from that model, carry out the following algorithm:

Define a “current vertex” variable c

_{v}and initialize it as c_{v}:= v_{0}.Select an edge (v

_{x}, v_{y}) from the set of edges emanating from c_{v}according to the edge probability distribution at c_{v}. Emit the label of the edge and set c_{v}= v_{y}.If c

_{v}is a final vertex, terminate; otherwise, go to step 2.

The model so-specified is similar to the well-known class of haplotype graph models [1]. In particular, definition of a suitable emission probability distribution on top of each edge label will result in a Hidden Markov Model (HMM); and the model can easily be generalized to emitting diploid data. For details, see [1, 2].

PRG construction algorithm, introductory definitions:

Start with a multiple sequence alignment (MSA) of all S

_{N}scaffold haplotypes (constructed using external software, see section “Sequence Alignment”). The MSA has S_{N}rows and L columns (L depends on the exact scoring configuration and algorithm used for creating the MSA; also, L is also the last level of the PRG to be constructed).Each level i > 0 in the PRG refers to the i-th column of the MSA.

S

_{n,i}denotes the i-th position of the row of the MSA for haplotype n. (n ∈ {1 .. S_{N}}, i ∈ {1 .. L}).There is a set X of additional unique variant specifiers of the form (n,i

_{1},i_{2},*seq*). n specifies which row of the MSA (which scaffold haplotype) the variant specifier refers to; i_{1}and i_{2}define the column range within the MSA that the variant specifier refers to; and*seq*defines an alternative sequence of characters (constrained to the set of nucleotides, “_” and “*”) to be available at these columns. For example, a SNP could be specified as (2, 5, 5, “A”): this means that position 5 in scaffold haplotype 2 might also be an “A”, instead of, say, a “T”. Although in principle fully generalizable, the current version of the algorithm is specified only for variant specifiers of length 1 (i.e. i_{1}= i_{2}); however, it is possible to specify independent variant specifiers at consecutive positions. Overlapping variant specifiers are not permitted.For each vertex v, we define a function H(v) which specifies a set of scaffold haplotypes “attached” to v. At each level of the PRG, each scaffold haplotype is attached to exactly one node. Informally, the set of attached scaffold haplotypes will determine which labels the edges reachable from n will carry.

We also define a function

*suffix*(v, r) for each vertex v at level l(v) ≤ (L – r) which returns the set of strings defined byFor all n ∈ H(v), the concatenated symbols from columns l(v) + 1 .. l(v) + r of row n of the MSA (i.e. from the scaffold haplotypes attached to v).

The strings from a), modified by the set of relevant variant specifiers. Each distinct variant identifier for a particular position (i.e. relating to the same scaffold haplotype; at the same position; with a distinct

*seq*) maps to at least one additional string, generated by substituting the specified position in the string from a) with the allele specified by the variant specifier.For performance reasons, however, we ignore recombination between variant specifier at different positions (this only applies to the function

*suffix*(..), and not to the PRG itself). For example, two nearby tri-allelic positions might induce*suffix(..)*to return three strings. We note that the effect of this simplification is that suffix distributions from different nodes might look more distinct than they really are, but never more similar.

PRG construction algorithm:

Construct a multiple sequence alignment from the scaffold haplotypes (using external software, see section “Sequence Alignment”).

Initialize the PRG with a single start vertex v

_{0}at level 0 and set H(v_{0}) to be the set of all scaffold haplotypes.Iterate from i = 1 .. L:

For each vertex v at level (i – 1):

For all scaffold haplotypes with index n from the set H(v):

Determine the character S

_{n,i}of the corresponding position / haplotype in the MSA and the set {(n,i,i,*seq*)} of relevant additional variant specifiers. If there exists an edge (v, v’) carrying the character S_{n,i}, follow that edge and add n to the set H(v’). Otherwise, create a new vertex v’ with l(v’) = i and the corresponding edges (v, v’), labeled with the S_{n,i}. Further ensure that for each element {(n,i,i,*seq*)} there is an edge (v, v’) labeled with*seq*.

For all possible pairs (v

_{x}, v_{y}) of (newly created) vertices at level i:We merge v

_{x}and v_{y}if their suffix distributions*suffix*(v_{x}, r) and*suffix*(v_{y}, r) are similar. We use r = 20 and require that all r compared positions are non-gaps (i.e. ≠ “_”). If some of the r compared positions are gaps, we dynamically increase r (due to performance only up to a threshold– we assume that two suffix distributions are not identical if we reach this point).“Similar” in the currently implementation means “identical”. If the set of scaffold haplotypes was large and representative of a population, using similarity measures similar to those used when constructing haplotype graphs would be an alternative ([1, 2]).

Suffixes that contain wildcard characters are treated in a special way. We define that vertices with suffix distributions that contain only wildcard characters (“*”) are compatible with all other nodes; and that single suffix strings consisting exclusively of wildcard characters are not considered when determining suffix compatibility.

To merge two vertices v

_{x}and v_{y}, define H(v_{x}) = H(v_{x}) ∪ H(v_{y}); redirect all edges going into v_{y}to v_{x}(and remove duplicate edges with identical labels, if necessary); and delete v_{y}.If any vertices were merged during step b), repeat step b).

We have now constructed a PRG for the complete MSA. For post-processing,

Mark all nodes and edges that are necessary to trace the path of the scaffold haplotype which is identical to the canonical reference genome in the region through the graph.

Identify and remove arcs (sequences of connected nodes and edges, equivalent to a subpath through the graph; see Section “kMerification” for a formal definition of “subpath”) that consist exclusively of edges labeled with wildcard symbols (“*”) and that can be removed without affecting the reachability of any subpath that does not exclusively consist of wildcard symbols. (We remove as many wildcard symbols as possible because we ignore them in downstream analyses).

### 1.1 Varying number of scaffold haplotypes

The algorithm specified assumes a fixed number of scaffold haplotypes across the region to be modeled. This is, however, not always the case: in the extended MHC, for example, we have eight scaffold haplotypes across the region (see Section “Graph for the extended MHC”), but a much higher number of haplotypes for the six classical HLA loci.

We deal with such situations by dividing the whole region to be modeled in stretches with an identical number of scaffold haplotypes; we then construct separate PRGs for each stretch; and finally we connect the separate PRGs (by fusing last-level nodes of one PRG with the first-level nodes of the next PRG) to obtain a combined region-wide PRG.

## 2 Genome inference

Before giving a formal description of the process of genome inference, we give an informal introduction.

The object we have built – a nucleotide PRG – models population sequence variation at the level of individual nucleotides. While having many advantages (immediately intuitive representation of sequence; easy to visualize; clear preservation of gap homology from the MSA), there are also disadvantages to this approach. Making genome inference (i.e. computing the two most likely paths through the PRG for some given sample data, and detection of additional variation) from the nucleotide PRG itself is not straightforward, for we don't directly observe its most fundamental unit, the nucleotide in its full PRG-level context, from sequencing data.

There are two ways around that:

Computation of all possible micro-haplotypes (kMers) of specified length through the nucleotide PRG and counting how often each micro-haplotype occurs in sample data.

We refer to this process as “kMerification” and use it as our main approach for genome inference.

Read alignment – in analogy to read-to-reference alignment, it is possible to align sequencing reads to a PRG, resulting in a labeling of each individual nucleotide with the level of the PRG it is assumed to be homologous to.

Read alignment has many advantages over kMerification – it is more tolerant of sequencing errors and utilizes the full length of sequencing reads to establish homology. However, at least in our current implementations, it is much slower than kMerification-based inference. Hence we use read-to-graph alignment to validate (see Section 4) the genomes we have inferred, but not as a primary means of genome inference.

As stated above, we rely on kMerification as the base for genome inference.

That is, we “kMerify” the nucleotide PRG to obtain a kMer-PRG, a graph with equivalent haplotype structure but edges labeled with kMers instead of individual nucleotides.

To improve some statistical properties of that graph for downstream inference, we further “edge-compress” the kMer-PRG to obtain an object we refer to as “multi-PRG”. The main difference is that we have combined all non-branching stretches of levels in the kMer-PRG into single levels for the multi-PRG, with edges labeled with multiple kMers.

Finally, by assuming that observed kMer counts in the sample data follow a Poisson distribution parameterized by how often a kMer appears in the genome (i.e., after some pruning: how often a kMer appears in a state of the multi-PRG), we use the multi-PRG to derive an efficient Hidden Markov Model (HMM) to infer an individual’s two assumed haplotypes.

The following figure illustrates the different steps of transformation from nucleotide PRG to multi-PRG.

### 2.1 kMerification

The PRG has a probability distribution over the space of possible emitted haplotypes. Each possible haplotype can be transformed into a sequence of kMers. A PRG thus also induces a probability distribution over possible sets of emitted kMers. We now describe the algorithm to transform the PRG into a kMer-PRG that has the same distribution over the set of possible haplotypes and hence over the possible sets of emitted kMers.

Informally, we are searching for an object with equivalent haplotype output distribution, but which explicitly specifies kMers. We call that process “kMerification”.

In order to kMerify a nucleotide PRG, we need to define the set of nodes and edges that the kMer-PRG is to consist of. We define the edges by finding all “micro-haplotypes” of length k (i.e. subwalks of k non-gap characters) specified by the nucleotide PRG. In the process of kMerification, each such micro-haplotype will become one edge in the kMer-PRG.

Consider, for example, the two subwalks of length 3 starting at level 9 marked with blue lines in the figure above. In the kMer-PRG, both walks will be represented by separate edges. Note that there will be two edges, even though they will be labeled with the same kMer – this is because the underlying walks through the nucleotide PRG are different.

We now have an intuitive understanding of the edges that will be present in the kMer-PRG, but what about the nodes? Initially assume that (at a given level) each edge in the kMer-PRG ends in its own node. In order to ensure that the haplotype structure of the kMer-PRG reflects that of the nucleotide PRG, the question now becomes: which of these nodes have to be merged? Or, expressed differently: which kMer-PRG edges (reflecting walks of length k) need to end in the same kMer-PRG node?

In order to analyze that question for our two example paths, consider subwalks of length k + 1 from level 9. In the figure above, these walks can be constructed by elongating the two blue walks of length k by adding the orange edge. We note that the two subwalks of length k + 1 are, apart from the first component, identical: by definition the next level of the kMer-PRG will contain exactly one edge to represent the corresponding walk of length k. From the structure of the nucleotide PRG, it is also clear that this one kMer-PRG edge (starting at level 10 in the nucleotide PRG) needs to be connected to the two original kMer-PRG edges (starting at level 9 in the nucleotide PRG). Hence the two original kMer-PRG edges starting at level 9 in the nucleotide PRG need to end in one node, and the kMer-PRG edge starting at level 10 in the nucleotide PRG will start at that node.

Expressed more formally, at each level each edge in the kMer PRG, representing one walk through the nucleotide PRG, ends in a separate node, unless there is second edge in the kMer PRG at the same level which represents a walk that is, apart from the first component, identical to the walk of the first edge.

kMerification algorithm, introductory definitions:

A kMer is defined as a word (string) of length k, consisting of the characters A, C, G, T and the symbol for ambiguity (“*”), or as a string of length 0. We refer to 0-length kMers as “gap” kMers.

To transform a sequence of characters (consisting of the four nucleotide characters, gap (“_”) and ambiguity (“*”) symbols) of length L into a sequence of kMers, carry out the following algorithm for i = 1 .. (L-k+1): Starting from i and extending to the right, extract a substring of length >= k which contains exactly k non-gap characters. If the first symbol of the so-extracted substring is the “gap” symbol, we define the kMer emanating from position i to be the “gap” kMer. Otherwise, remove all “gap” symbols from the substring and we define the kMer emanating from position i to be equivalent to the substring.

(To give an example, transformation of the sequence AC_TAG into kMers of length 3 yields: “ACT”, “CTA”, “_”, “TAG”. “_” represents the gap kMer.)

Whenever a kMer contains one or more wildcard characters, we substitute all non-wildcard symbols in the kMer with the wildcard character. (One wildcard character makes the whole kMer ambiguous; we make that definition because it simplifies downstream statistical analyses.)

The definition of kMer-PRGs is identical to that of PRGs, with the exception that edges are labeled with kMers of length k (“kMer-edges”). k remains constant across a kMer-PRG.

A subpath in the PRG is defined as a sequence of edges e

_{1}, e_{2}… of the structure e_{1}= (v_{x}, v_{y}), e_{2}= (v_{y}, v_{z}) … (we could thus also call a subpath a “walk” through the graph). The set of possible subpaths from the start vertex to one of the final vertices defines the set of possible emitted haplotypes. We could measure the length of a subpath by either the total number of included edges or by the length of included edges which carry non-gap characters. Here we use the second measure (kMerification has to “jump over” gap characters), i.e. we define the length of a subpath to be the number of included edges that are not labeled with gap characters.A subpath defines a sequence of emitted symbols and can thus be transformed into a sequence of kMers (if it is long enough). Specifically, a subpath of length k induces a kMer of length k).

It is clear that each subpath of length k in the PRG has to exist as one kMer-edge in the kMer-PRG. While building the kMer-PRG, we keep track of each kMer-edge’s underlying subpath and of the nodes and edges in the PRG that this subpath traverses.

We define a function K(v) for all vertices v of the PRG that specifies a set of kMer-edges of the kMer-PRG that are attached to v. A kMer-edge can be attached to a node in the normal PRG if v is its underlying subpath’s second node (but not all kMer-edges have to be attached to a vertex). In implementing the algorithm, K(v) is the empty set for all vertices in the beginning and then extended iteratively while creating the kMer-PRG.

We now describe the core algorithm for kMerification. kMerification works incrementally, i.e. we will move through the nucleotide PRG from left to right, creating nodes and edges of the kMer-PRG as we walk along.

Informally, to define the initial set of kMer edges, we carry out a forward search for subpaths of length k starting at v_{0} of the PRG.

Then, conditional on an existing set of edges of the kMer-PRG (each representing a walk ending in a node in the nucleotide PRG), we create the next level of kMer-PRG edges by extending the walks of the edges of the present level by one character each (or by all possible one-character extensions, if there are multiple possible extensions).

To define the set of nodes required at each level of the kMer-PRG, we remove the first element from the walks (through the nucleotide PRG) corresponding to the kMer-PRGs edges at that level, and we create one kMer-PRG node for each unique reduced walk.

We stop once we have covered the complete PRG.

kMerification algorithm:

We initialize the kMer-PRG by creating a start vertex v

_{0;kMer}(we use the additional subscript to distinguish between v_{0}of the PRG and the start vertex of the kMer-PRG). We carry out a forward search (using standard algorithms) for subpaths of length k starting at v_{0}of the PRG. We create a kMer-edge for each found subpath, transform the underlying subpath sequence into a kMer and use this kMer as a label for the edge. We attach each created kMer-edge of the form (v_{0;kMer}, v_{n}) to v_{0;kMer}and leave the endpoint v_{n}undefined for the moment.We note that it is possible to have multiple kMer-edges carrying the same kMer as label (if the underlying subpaths are different).

The edge probability distribution for the start node of the kMer-PRG is induced by the probabilities of the corresponding subpaths in the PRG.

After creating edges for all subpaths emanating from v

_{0}, we compute the set of required vertices for the next level of the kMer-PRG, and attach the created edges to those vertices. We give the precise algorithm under point “C)”.This completes the kMerification of level 0 of the PRG.

To kMerify levels i = 1 … (L – k) of the PRG.

For each vertex v of l(v) = i in the PRG:

For each kMer-edge e = (v

_{x}, v_{y}) in the set K(v):Informally, the set of kMer-edges emanating from node v

_{y}in the kMer-PRG is determined by how we can extend the subpath underlying e.Formally, the subpath underlying e has a defined last vertex v

_{z}(v_{z}is a node of the PRG). We carry out a forward search for subpaths of length 1 starting at v_{z}. Each subpath defines one possible extension for e and thus induces the creation of a kMer-edge e_{x}(as in step “A)”, the endpoint vertices for these edges remain undefined for the moment).The subpath underlying e

_{x}consists of the subpath underlying e with its first symbol removed and extended by the subpath (of length 1) that induced the creation of e_{x}. The label for e_{x}is generated by computing the kMer induced by the subpath underlying e_{x}. (In particular, this means that the edge will be labeled with the “gap kMer” if the first symbol of the underlying subpath is a gap).The edge probability distribution at v

_{y}is induced by the conditional probability to follow the found subpaths, conditional on being at node v_{z}.

For the set of all kMer-edges created during the previous step “a”:

Compute the set of required vertices for the next level I + 1 in the kMer-PRG, conditional on the set of kMer-edges created during the previous step, and attach the edges to their corresponding endpoint vertices (so far, these have remained undefined). We give the precise algorithm under point “3)”.

We describe how to compute the set of required vertices for a level i + 1 of the kMer-PRG, based on the set E

_{i}of kMer-edges emanating from nodes of level i of the kMer-PRG.The result of this step (as explained above) will be that all edges that have “nearly” identical corresponding subpaths in the nucleotide PRG will end in the same node, with “nearly” here meaning that they are identical in all steps but the first.

Each e = (v

_{m}, v_{n}) ∈ E_{i}has an associated subpath in the PRG of the form (e_{1},e_{2},e_{3}..). We define the m1-subpath of a subpath (e_{1},e_{2},e_{3}..) as (e_{2}, e_{3}..), i.e. equal to the original subpath without the first edge.Clearly, there is a set of m1-subpaths induced by the set of subpaths associated with the kMer-edges in E

_{i}. We will create one vertex v at level l + 1 of the kMer-PRG for each unique m1-subpath (e_{2}, e_{3}..), and attach each kMer-edge to the vertex corresponding to the m1-subpath of its subpath.We note that this means that the original subpaths of the edges that end up attached to one vertex are identical from the second position onwards. We will thus attach one of the edges attached to the same vertex to K(v

_{x}), with the m1-subpath of the edge being of the form (e_{2}, e_{3}..) and e_{2}= (v_{x}, v_{y}).As a post-processing step, mark all kMer-edges for which the underlying subpath includes edges that were marked as parts of the canonical reference sequence (see post-processing for the PRG construction algorithm).

We also remove all subpaths that both consist exclusively of edges labeled with the wildcard character (“*”) and that can be removed without affecting the reachability of any subpath that is not exclusively labeled with wildcard symbols.

### 2.2 Edge compression

Edges in the kMer-PRG are labeled with single kMers. There are regions in the graph where each vertex has exactly one outgoing and one incoming edge, i.e. regions which consist of sets of non-branching subpaths. In a final transformation, we “compress” all such regions of the kMer-PRG into single edges, which are then labeled with (unordered) sets of kMer counts. We call the resulting graph the multi-PRG (“multi” because edges can be labeled with multiple kMers).

Formally, the definition of a multi-PRG is equivalent to the definition of kMers-PRGs, with the exception that each edge is now labeled with sets of the form {(kMer_{1}, count_{1}), (kMer_{2}, count_{2}), …}, where kMer_{1}, kMer_{2} .. are unique kMers and count_{1}, count_{2} .. specify how often the associated kMer is emitted upon edge traversal (we call these edges “multi-edges”).

Constructing a multi-PRG from a kMer-PRG is trivial. We give a short description of the algorithm.

Transform the kMer-PRG into a structurally equivalent multi-PRG, simply by substituting all edge labels

*kMer*with {(*kMer*, 1)}.Determine which regions from level i

_{x}to level i_{y}in the multi-PRG satisfy the following criteria:All nodes at level i

_{x}have exactly one outgoing edgeAll nodes at level i

_{y}have exactly one incoming edgeAll nodes with level > i

_{x}and < i_{y}have exactly one outgoing edge and one incoming edge.

This implies that there is one and only one subpath connecting each node from level i

_{x}with a node from level i_{y}(and vice versa).For all such regions from level i

_{x}to level i_{y}:For all vertices v

_{x}from level i_{x}:Determine the subpath connecting v

_{x}to a node v_{y}from level i_{y}Create a new multi-edge by forming the union (with counts) of the labels of all edges traversed by the subpath. The new multi-edge connects the first and the last node traversed by the subpath.

Delete all edges traversed by the subpath and all vertices traversed by the subpath except for the first and the last vertex.

The level function becomes temporarily ill-defined during the previous steps. There are different ways to deal with this in implementations. We use a temporary relaxation of the definition of the level function (specifically, dropping the requirement that vertices connected by one edge need to be assigned to subsequent levels) and a post-processing step to create a definition-consistent level structure (informally, by re-counting levels from the beginning of the graph).

As a post-processing step, mark all multi-edges created from edges that were marked for being part of the canonical reference sequence.

### 2.3 HMM

Multi-PRGs can be transformed into Hidden Markov Models (HMMs). One key challenge is that kMers which appear as edge labels on multiple levels of the graph violate the assumption that emissions from different states of the HMM to be created are independent. We deal with this by removing those kMers.

We and others have previously described how haplotype graphs can be transformed into haploid and diploid HMMs (see [1] and [2]), and we will only give a short informal description of this process here: each edge of the haplotype graph becomes a state in the HMM, and the emission probability distribution for each state is based on the underlying edge’s label. The resulting HMM will have a level structure analogous to that of the graph. State transition probabilities for the HMM are induced by the probability to move from one edge (v_{x}, v_{y}) to another edge (v_{y}, v_{z}) in the graph, i.e. by the edge probability distribution at v_{y}.

We note that a modified version of that transformation can be applied to multi-PRGs. The structural transformations (i.e. from edges in the multi-PRG to states in the HMM and the state transition probabilities) are identical to the original transformation, and we refer the reader to [1] for details.

We now define the emission probability structure of the HMM. Specifically, we define the emission probabilities of an HMM state s with underlying multi-edge e = (v_{x}, v_{y}). We assume that there is a coverage parameter α and an error rate parameter β. α specifies the expected haploid coverage on kMers that are present in a sample genome being sequenced, and β specifies the expected haploid coverage on kMers present in the multi-PRG which are not present in the sample genome being sequenced (We define how to estimate these parameters below).

Identify and remove from all edge labels kMers which

occur in the label sets of edges emanating from more than one level of the multi-PRG

occur in a region of a canonical reference genome not covered by the PRG

fail plausibility-based checks (see below)

Compute the set O

_{i}of all remaining kMers used in the edge label sets of edges emanating from any node at level l(v_{x}) = i, excluding“gap kMer” kMers (as “gap kMers” represent the empty string, they cannot generate an emission)

kMers with ambiguous symbols (this is a simplification we make on practical grounds and one of the reasons for why we try to eliminate those as far as possible).

Each kMer

*kMer*∈ O_{i}has an observed sample count o(*kMer*). o(*kMer*) is calculated by counting the number of occurrences of the word*kMer*in the sequencing reads from a sample.We define the observed sample counts o(

*kMer*) ∀*kMer*∈ O_{i}as the data we will model for all states at the level of state s.The state s that we want to define an emission probability distribution for has a label, and that label is a set of the form {(kMer

_{1}, count_{1}), (kMer_{2}, count_{2}), ..}.We note that each kMer in O

_{i}either appears in the label set for e, or not.If it appears in the label set for O

_{i}as (*kMer*,*count*), we model the observed count o(*kMer*) with a Poisson distribution, with expected value α x*count*.If it does not appear in the label set for O

_{i}(i.e. it is an error if we assume that s is the underlying state and the multi-PRG is a faithful representation of genomic variation), we model o(*kMer*) with a Poisson distribution with expected value β.

#### 2.3.1 kMer count plausibility checks

In the ideal case, the PRG is a comprehensive representation of genomic variation. However, in practice it is likely that there is genetic variation which is not captured by the PRG, which may distort our statistical approaches.

One important class of potentially confounding uncaptured variation is sequence duplication. In making inference, we generally exclude all kMers which occur in other parts of the reference genome or in other levels of the multi-PRG. Sequence duplication can have the effect that a kMer which we believe to be exclusively originating from one level of the multi-PRG is also present in another genomic location. That is, observed coverage on such a kMer not only comes from the multi-PRG, but also from other uncharacterized sources. One important symptom of this happening is excessive observed coverage on a kMer. Hence, for each kMer, we

assume that observed coverage in read data is modeled by a Poisson distribution, with mean proportional to general kMer coverage and the kMer’s underlying genomic count

assume a uniform prior on the kMer’s underlying genome count (ranging from 0 to at least 2 x the maximum number of occurrences of the kMer in any state of the multi-PRG)

combine prior and Poisson model to obtain a posterior probability of underlying genome count for each kMer

sum over all values of that posterior between 0 and ([maximum count of that kMer in any state of the multi-PRG] +2)

exclude the kMer if that sum is < 0.5.

#### 2.3.2 Simplification

It is not uncommon to observe levels with thousands of states in the haploid HMM, and the number of states at a particular level increases by power 2 in the diploid version of the HMM (see [1] for the algorithm for creating a diploid version of the HMM).

We employ a simplification algorithm prior to applying the model to sample data: before creating the diploid HMM, we examine all edges of the multi-PRG. For each edge, we determine the fraction of kMers specified (weighted by counts) with no coverage in the sample (i.e. o(*kMer*) = 0). If this proportion is ≥ 0.5, we delete the edge – unless it was marked for being part of the canonical reference genome. (The rationale for this is to always have at least one remaining path even if coverage is 0. The canonical reference genome is thus the fallback path for the model.)

#### 2.3.3 Estimation of α and β

We estimate α, the coverage parameter, and β, the error rate parameter, from the multi-PRG.

For estimating α, we identify levels of the multi-PRG at which there is only one edge defined per level. If the multi-PRG faithfully reflects the underlying genome, all of these edges have to have diploid coverage. We sum over the observed sample coverage of the kMers specified by these edges and divide them by the sum of the edge-label specified kMer counts and a factor of 2.

For estimating β, we employ a simple greedy heuristic algorithm. We select (typically ∼ 200) levels of the multi-PRG according to these criteria:

There are exactly two defined edges

Both edges have the same number of non-gap non-ambiguous kMers (the label set element (

*kMer*,*count*) contributes*count*to the kMer count of an edge).

We take α as estimated previously and start with an initial value for β (typically a small value below 1%).

At each selected level individually, we fit a simple Poisson model allowing for the three diploid “genotypes” defined by the two haploid edges (we use the same model as for the HMM emission probabilities, described earlier; this is conditional on current values of α and β). We select the maximum likelihood (ML) “genotype” call and treat coverage on edges not covered by the ML call as error.

After application to all selected levels, we sum over the total coverage on kMers not present on the edges of the ML calls, and divide it by the total number of kMers not present on these edges (i.e. a rate specifying how much coverage the kMers on edges which are likely not in the underlying genome receive). This gives us a new estimate for β. We repeat the estimation procedure, now conditional on the updated value of β, until the improvement in “total likelihood” of the data (here defined as the product of the likelihoods of the per-level ML calls) from one iteration to the next falls below a threshold.

### 2.4 Chromotypes

#### 2.4.1 Path inference

Standard statistical algorithms for HMMs can be applied to the defined HMM. Specifically, the Viterbi algorithm enables inference of one (diploid) Maximum Likelihood path through the model, and the Forward algorithm can be used to sample from the posterior distribution of paths [3].

The inferred diploid path is initially based on the multi-PRG, but can be mapped back unambiguously to equivalent traversals of the kMer- and nucleotide PRGs (each state of the [diploid] HMM is equivalent to two edges [one for each haploid traversal of the PRG] in the multi-PRG, and the transformation process from nucleotide PRG to multi-PRG can be reversed).

By concatenating the symbols emitted by the edges traversed by the ML diploid path through the nucleotide PRG, we create two “personalized reference haplotypes”.

However, the inferred diploid ML path (and hence the personalized reference haplotypes) loses phase at positions at which the underlying haploid HMM states are identical (that is, at positions at which both haplotypes traverse the same multi-PRG edge). “Loses phase” means that another diploid path, generated from the original path by flipping the two contained haploid paths after a position at which phase is lost, is statistically identical to the original path. For each Viterbi diploid path, we have thus a set of equally good diploid traversals which are identical to the Viterbi path in the edges that they traverse at each level, but not in haplotypic phase between different levels.

#### 2.4.2 Chromotypes

Chromotypes are a data structure to represent chromosomal genotypes at different levels of haplotypic resolution. The formal definition of a chromotype is that it is a PRG with not more than two edges at each level (as shown in the pictures below).

A diploid genome can be represented as two independent, completely resolved paths (“Chromotype B”), or as a sequence of homozygous stretches and heterozygous bubbles (“Chromotype A”) with phase lost whenever the chromotype enters a bubble – or as a mixture of the two approaches.

Chromotypes are a data structure well-suited to represent the set of Viterbi-equivalent diploid PRG traversals:

Based on the state sequence of the diploid HMM, compute the equivalent two paths through the multi-PRG, and then the equivalent two paths through the kMer-PRG.

Each path (traversal) if a sequence of nodes and edges through the kMer-PRG.

Create a (kMer) chromotype by combining the nodes and edges (and their connectivity) from the two traversals into one new graph. Nodes and edges appearing in both paths will appear only once in the chromotype. If an edge or node appears in both haploid paths, it is to appear only once.

If desired, this (kMer) chromotype can be converted to a nucleotide chromotype, for every kMer edge is equivalent to a subpath through the nucleotide PRG.

Chromotypes so-generated are subgraphs of the original PRG, describing an individual’s chromosomal genotype.

Chromotypes can also be created from sets of aligned strings. In the pipeline described here, this happens after read re-mapping to the Viterbi chromotypes – we use the re-mapped reads to identify new variants not yet represented in the PRG, but the discovery pipeline cannot deal with diploid reference genomes. Hence, we map to the personalized reference haplotypes, and modify the two strings according to the variants discovered. As a final step, we recombine the two modified haplotype strings back into a chromotype.

To create a (nucleotide) chromotype from two aligned strings, begin with a chromotype that has one start vertex and two completely resolved branches (encoding the two aligned strings) without branches. At each level, fuse nodes and edges of the two branches if the next k edge labels are identical for both branches.

#### 2.4.3 VCF creation

We note that the two personalized haplotypes imply a genotype at each position of the original MSA. It is sometimes desirable to express this information in VCF format. We give a sketch of the heuristic that we use to express one personalized reference haplotype in VCF format (the diploid case with two haplotypes follows then immediately).

There is a canonical reference which the VCF will refer to, also present as a row in the MSA used to create the PRG (in our case, this would be the chromosome 6 xMHC reference sequence “PGF”). At each column in the MSA, the canonical reference will either carry a specified non-gap character or a gap symbol.

For positions in the MSA at which the canonical reference is non-gap, creating the VCF for an inferred genotype is trivial. This includes cases in which the genotype to be expressed is a gap, but the canonical reference is non-gap – this means that the genotype to be expressed carries a deletion with respect to the reference, and all that is necessary is to find the starting position of the deletion with respect to the canonical reference. This is usually the first non-gap position to the left of the deletion in the haplotype to be expressed in a VCF, at which the canonical reference is also non-gap.

Positions where the haplotype to be expressed is non-gap, but the reference is gap, represent insertions in the haplotype with respect to the reference. In VCF format, insertions are usually expressed as longer alternative alleles at the position before the first inserted nucleotide. We can thus find the correct position for specifying an insertion by finding the first position to the left of the inserted sequence at which the reference row of the MSA is non-gap.

When creating VCFs, our algorithm produces additional files and annotations, which specify the positions in the MSA a variant in the VCF refers to.

## 3 Novel variant detection

### 3.1 Mapping reads to the personalized reference

After inferring two “personalized reference” haplotypes, using sequence read data from an individual, it is often desirable to use classical read-mapping tools to map the reads to the two personalized haplotypes. This enables the discovery of additional variants not yet present in the PRG, and can also be used as an additional step of quality control (regions of the personalized haplotypes with no coverage are generally less likely to faithfully represent the underlying genome than regions with good coverage).

We apply the following pipeline steps:

Create two “personalized reference genomes” by excising the region covered by the PRG and inserting the inferred personalized reference haplotypes.

Map read data from an individual independently to both personalized reference genomes. Mate-pair information could be used in this step. Our current implementation, however, ignores mate-pair information, for it cannot be assumed that the two inferred reference haplotypes correctly reflect long-range phase.

For each read mapped to the region covered by the graph, make a decision as to what personalized reference haplotypes it is likely to correspond to. We employ the following heuristic:

If a read is mapped to the region covered by the graph in only one personalized reference genome, assume that the read comes from the corresponding haplotype. As the rest of the reference genome remains unchanged, this implies that the mapping quality of the read over the region covered by the graph is higher than that of the read over alternative locations.

If the read maps to the region in both personalized reference genomes, use the mapping quality produced by the alignment software (and possibly other criteria like edit distance between read and the employed reference genome) to decide which of the two haplotypes the read should map to.

In cases of equal fit, choose uniformly.

Create two BAM files, covering the two personalized reference sequences and containing the reads mapped to each haplotype (from step 3).

Apply a variant-calling algorithm to the two BAM files, resulting in two VCF files.

Modify the two personalized reference haplotypes according to the produced VCF files (see below).

Finally, produce a VCF file representing the two modified personalized reference haplotypes (see Section “VCF creation”) and the corresponding chromotypes (see Section “Chromotypes”).

### 3.2 VCF-based reference haplotype modification

Step 6 of the algorithm specified above results in two VCF files, and we want to use the information from these VCF files to modify our personalized reference haplotypes.

We have developed a heuristic that solves this task to satisfactory degrees of accuracy (see the validation results presented in the main paper). However, we recognize the limitations of our approach. On a fundamental level, existing read mapping and variant calling tools cannot deal with diploid references, and this leads to complications in downstream analyses of the results.

The VCFs we want to use to modify the personalized haplotypes may specify insertions, deletions and single-nucleotide differences. We note that the two personalized reference haplotypes live in the coordinate space defined by the MSA. For each position in the MSA, we will now compute a set of implied novel genotypes (that is, implied by the VCFs based on the personalized reference haplotypes). We note that we need to allow for multi-character alleles in the columns of the MSA – to deal with insertions implied by the VCFs, for which there might be “no space” in the existing MSA coordinate system. The algorithm described under “VCF creation” can deal with multi-character alleles with small modifications.

We now describe how to integrate variants of different types into the two haplotypes. For each variant, the VCF it comes from will specify a position that it begins at; this position is relative to the personalized reference used a base for mapping, and can thus be translated into MSA coordinates. Note that each VCF will specify two alleles for each position; we will thus end up with up to four specified alleles at each position.

Single-nucleotide differences: add the implied alleles to the set of implied alleles in the corresponding column of the MSA.

Deletions: add the implied alleles (i.e., gap characters) to the set of implied alleles in the corresponding column of the MSA.

Insertions: it might be that our original personalized haplotype called a deletion where there really is no deletion; and the VCF we analyze might reflect this by specifying an insertion at that position. We want to make sure that we place the characters specified by the insertion in the right columns of the MSA. Thus,

Find the beginning of the insertion and the first character after the insertion, relative to the personalized haplotype, in the MSA.

If there are no gaps between these characters, simply amend the column relating to the beginning of the insertion with the inserted characters.

If there are gaps between these characters, temporarily fill the gap characters with the characters that the canonical reference specifies at these positions; use the Needleman-Wunsch algorithm [4] to align the canonical reference characters to the inserted allele; use this alignment to map each inserted character to a position in the MSA.

In the algorithm described, we have made no attempt to determine whether novel alleles should be integrated into the first or the second personalized reference sequence. We use a heuristic that tries to integrate variants in the haplotype that led to the generation of the variant-implying VCF; in most cases, however, the novel variants are SNPs and there is no (locally phase-determining) information to determine the haplotype of origin.

### 3.3 Current components

The above description of the algorithm does not assume the utilization of particular read mapping or variant calling tool. In our current implementation, we use BWA [5] for mapping and Platypus for variant calling. We also use samtools [6] for some of the intermediate steps.

## 4 Graph alignment and validation

### 4.1 A mapping algorithm for PRGs

The aim in conventional sequence alignment is to identify regions of homology (or more generally, similarity, according to a score function) between two sequences. In the context of the analysis of next-generation sequencing data analysis, sequence alignment (or mapping) between the sequence of a read and the reference genome is used to identify the putative genomic origin of the read.

The alignment problem exists for PRGs, too: identify regions of homology between a given sequence (henceforth called “query sequence”) and a population reference graph. There are two challenges. First, PRGs specify a set of possible paths through the graph, and the query sequence could align to any of these paths. Second, PRGs can comprise “gap” edges – so the optimal alignment would specify “gap” symbols whenever a “gap” edge is traversed, but without penalizing them in the way normal gaps in the alignment are typically penalized.

To address these challenges, we have developed a generalization of the Needleman-Wunsch algorithm for global sequence alignment. We have also developed a set of approximations to reduce the computational complexity of the algorithm, enabling the alignment of query sequences on the order of thousands of bases to PRGs on the order of millions of levels (i.e. regions or genomes Megabases in size).

#### 4.1.1 Statement of the problem

There is a query sequence Q=(q_{1},…,q_{NQ}) consisting of N_{Q} characters from the alphabet {“A”, “C”, “G”, “T”} and a PRG G = (V, E).

Define P_{sub} as the set of possible subpaths (v_{a},v_{b})(v_{b},v_{c})…(v_{m},v_{n}) through the graph (for definition of “subpath”, see Section “kMerification”) and let P_{traversal} contain all elements of P_{sub} for which v_{a} is the start vertex and v_{n} is one of the final vertices. Informally, the set P_{traversal} contains all complete traversals of the graph.

Define a symbol g_{2} which is neither contained in the query alphabet nor exists in E or as a label of one of the edges in E. g_{2} denotes the “gap” symbols introduced during the alignment process, which will be penalized, and, importantly, it is by definition different from the “gap” symbols already contained in G.

We call A = (Q’, E’) an **alignment** of query sequence Q to PRG G of length A_{L}if

A = (Q’, E’) = ((Q’

_{1}, Q’_{2}, ..,Q'_{AL}), (E’_{1}, E’_{2}, .., E'_{AL})), withQ’

_{x}∈ {“A”, “C”, “G”, “T”, g_{2}} ∀x ∈{1 … A_{L}}, andE’

_{x}∈ E ∪{g_{2}} ∀x ∈{1 … A_{L}}Q’ can be formed by inserting an arbitrary number (including 0) of g

_{2}elements into QE’ can be formed by inserting an arbitrary number of g

_{2}elements into an element of P_{traversal}There is no position x in the alignment with Q’

_{x}= E’_{x}= g_{2}.

For a given PRG and a given query sequence Q, define A_{all}(Q, G) as the (finite) set of possible alignments,

The **alignment problem** is to maximize a score function SCORE(*A*) → ℝ on the set of possible alignments A_{all}(Q, G).

#### 4.1.2 Scoring

Alignment scoring functions SCORE(*A*) → ℝ for A ∈ A_{all}(Q, G) can assume arbitrary form (we require that higher scores for a given pair A, G indicate better alignment quality, i.e. we want to maximize SCORE to find a good alignment).

From standard sequence alignment, however, it is well-known [7] that scoring functions of certain forms are amenable to efficient optimization via dynamic programming, in particular those which assign a fixed score to every column in the alignment with identical characters (“matches”), a fixed score to every column in the alignment with non-identical, non-gap characters (“mismatches”), and either a fixed or a linear score to every column containing gap characters. In the linear gap score case (“affine gap penalties”), the first gap in a sequence typically gets a lower score (“gap opening penalty”) than all subsequent gaps in the same sequence (“gap extension penalty”).

We will now define equivalent scoring functions for graph alignment. We define the function label(x) as the label of edge x if x is an edge, and as the empty string “” if x is equal to g_{2}. We need to make special provisions for the case where label(e) for an edge e is the “gap” symbol from the PRG.

##### 4.1.2.1 Non-affine gap penalties

Let a score function SCORE(A) = SCORE(Q’, E’) assume the following form with

We note that we assign score 0 to columns in which the traversed edge carries the “gap” symbol and in which the corresponding sequence column carries the (alignment-induced) gap-symbol g_{2}. Thus, the scoring function will neither reward nor penalize such columns, if SCORE_MATCH > 0 and SCORE_MISMATCH and SCORE_GAP< 0.

##### 4.1.2.2 Ends-free alignment scores

Ends-free alignment as defined here is typically applied if the query sequence is much shorter than the graph; the effects are (1) that the gaps which are necessary to extend Q to at least the graph’s length do not get penalized and (2) to favour “dense” alignments in which there are not many gaps between the original characters of Q.

Formally, in ends-free alignment, SCORE_GAP = 0 for position x if *Q*′* _{x}* =

*g*

_{2}and ∀

*x*

_{2}∈ {1 ..

*x*}:

*Q*′

_{x}_{2}=

*g*

_{2}or if

*Q*′

*=*

_{x}*g*

_{2}and ∀

_{x2}∈ {

*x*..

*A*

_{L}}:

*Q*′

_{x2}=

*g*

_{2}.

##### 4.1.2.3 Affine gap penalties

In affine-gap alignment, the first gap in a continuous sequence of gaps is typically scored differently from the subsequent gaps. “Opening” a gap is typically associated with a higher penalty than “continuing” a gap. Traversing a sequence of “graph gaps” should not end an affine gap in query sequence space.

For x = 0, IS_IN_AFFINE_SEQUENCE_GAP(x) and IS_IN_AFFINE_GRAPH_GAP(x) are defined as *false*.

We summarize the effect of these scoring functions:

For columns with two non-gap characters, we add either SCORE_MATCH or SCORE_MISMATCH.

g

_{2}characters in E’ are always penalized as gaps, and the first such character in a sequence typically more strongly so than the following characters.g

_{2}characters in Q’ are only penalized as gaps if the corresponding character in E’ is not the “gap” character “_”.In this case (i.e. if the g

_{2}character is penalized), it is determined whether this g_{2}character is the first one of an affine sequence gap, and a penalty is assigned accordingly.Affine sequence gaps can span columns in which E’ contains “_” symbols. In terms of the definitions made above, an alignment position

*x*is only part of an affine sequence gap if conditions*C*(_{A}*x*,*x*_{2}),*C*(_{B}*x*,*x*_{2}) and*C*(_{C}*x*,*x*_{2}) are all true for*x*and a*x*_{2}< x:C

_{A}(*x*,*x*_{2}): true if*x*_{2}< xC

_{B}(*x*,*x*_{2}): true if at*x*_{2}a sequence gap is initiated, i.e. there is a gap in the query sequence but no gap in the graph sequence at that position*C*(_{c}*x*,*x*_{2}): true if the gap initiated at*x*_{2}can be extended to*x*, i.e. all graph alignment positions in between carry either a defined graph character or a “_” graph gap symbol, and all sequence alignment positions in between carry a gap symbol.

Note that cases in which both Q’ and E’ carry the g_{2} character at a column are also invalid by definition.

#### 4.1.3 Maximization

The solution to the alignment problem is found by maximizing the supplied scoring function over the set of all possible alignments A_{all}(Q, G).

For the scoring functions described above, the maximization can be carried out by a dynamic programming algorithm, very similar to the Needleman-Wunsch algorithm.

The main difference is that the Needleman-Wunsch algorithm utilizes a two-dimensional matrix of scalars, whereas we utilize a two-dimensional matrix of vectors. For the Needleman-Wunsch algorithm, the value in cell (x,y) of the scoring matrix is defined as the maximum score attainable after having consumed x characters from the reference sequence and y characters from the query sequence. For our algorithm, each cell carries a vector, and we index the values in this vector by a third coordinate. Value (x, y, z) is then defined as the maximum score attainable after having consumed x levels from the graph, y characters from the query sequence, and ending up in vertex z of level x of the graph.

##### 4.1.3.1 Computation for non-affine gap penalties

PRG G has L levels, and query sequence Q has N_{Q} characters.

At each level l of G, let Z_{l} be the number of vertices at this level. Use the integers 1 .. Z_{l} to arbitrarily enumerate the vertices of each level l, and define two index functions that map between vertices and their associated indices and vice versa:

Now define a (N_{Q} + 1) × L matrix denoted M. Each cell (q_{i}, l) contains a vector of length Z_{l}. (We use l to index levels of the graph, and we use q_{i} to index positions within query sequence Q). We use the notation M(q_{i}, l, z) to denote the z-th value of the vector in cell (q_{i}, l).

For the purpose of this section, we use **0-based** indices to index M, the graph and Z_{l}. For example, Z_{0} refers to the number of vertices at the first level of the graph. Individual vertices, however, are indexed using 1-based indices, i.e. the first vertex at a level has the index 1.

We define auxiliary set functions that map a vertex to its potential ancestors and potential predecessors, along the graph.

By definition, the previous and next vertices for v are one level below / one level above v.

We will fill M cell-by-cell. (q_{i}, l, z) shall be the maximum attainable score after having consumed q_{i} characters from Q, l levels from G and ending up in the z-th node at level l of G.

We initialize M:

Origin: For all z∈ (1 .. Z

_{0}), setM(0, 0, z) = 0.

Gaps along the graph coordinate: For all l ∈ (1 … [L – 1]) and z ∈ (1 .. Z

_{l}), setGaps along the query sequence coordinate: For all q

_{i}∈ (1 … N_{Q}) and z ∈ (1 .. Z_{0}), set

We progressively fill M using a nested loop, in the order specified:

For each l ∈ (1 … [L – 1]), for each q

_{i}∈ (1 … N_{Q}), for each z ∈ (1 .. Z_{l}):Define

*v*_{P}= vertex_previous(index2vertex(*l*,*z*)).

(If there are pairs of vertices connected by more than one edge, the maximization in M_{A} and M_{C} needs to be carried out explicitly over edges, instead of nodes at the previous level).

In this recursion, M_{A} is the “gap in query sequence” step, M_{B} is the “gap in graph” step, and M_{C} is the “match or mismatch” step.

The final maximum is max_{z} _{∈{1..ZL−1}} *M*(*N*_{Q}, *L* − 1, *z*), and backtracking, analogous to the classical Needleman-Wunsch, will identify the corresponding alignment. This is easily implemented by having a second matrix, in dimensionality equivalent to M, which stores which coordinates each maximum was drawn from.

##### 4.1.3.2 Computation for affine gap penalties

In classical sequence alignment, affine gap penalties are realized by progressively filling three matrices: one matrix for paths that end in an affine “query sequence” gap, one matrix for paths that end in an affine “reference sequence” gap, and finally one matrix for paths that end arbitrarily, including those that end with a match or mismatch. Jumps between these matrices are allowed where appropriate according to these definitions. For example, the third matrix would always contain the maximum value of all three matrices for any given coordinate. We shall now proceed accordingly, but extend the classical framework to deal with graphs.

We use all definitions from the previous section where appropriate but instead of M, we now define three matrices M_{D}, M_{G} and M_{S}, all of dimensionality (N_{Q} + 1) × L. As for M, each cell in these matrices contains a vector with as many elements as there are nodes at the corresponding level of the graph.

M_{G} (q_{i}, l, z) shall contain the maximum attainable score of all paths ending in an affine graph gap (i.e. a gap *in between levels of the graph*, consuming a character of the query sequence), after having consumed q_{i} characters from Q, l levels from G and ending up in the z-th node at level l of G. M_{S} (q_{i}, l, z) shall contain the equivalent score for paths ending in an affine sequence gap. Finally, M_{D} (q_{i}, l, z) shall contain the equivalent score for paths ending arbitrarily.

We note that we need to make sure that the algorithm implements the provisions we specified in Section “Scoring” for edges carrying the graph “gap” symbol. Importantly, affine sequence gaps can go “through” such edges, but they must not begin by traversing them.

Initialization:

Origin: For all z∈ (1 .. Z

_{0}), setSequence gaps (along the graph coordinate): For all l ∈ (1 … [L – 1]) and z ∈ (1 .. Z

_{l}):Define:

Set:

Graph gaps: For all q

_{i}∈ (1 … N_{Q}) and z ∈ (1 .. Z_{0}):Define:

Set:

We progressively fill the three matrices using a nested loop, in the order specified: For each l ∈ (1 … [L – 1]), for each q_{i} ∈ (1 … N_{Q}), for each z ∈ (1 .. Z_{l}):

**M**_{S}:Define:

Set:

**M**_{G}:Define:

Set:

**M**_{D}:Define:

The final maximum is max_{z} _{∈{1..ZL-1}} *M*_{D}(*N*_{Q}, *L* – 1, *z*), and backtracking will identify the corresponding alignment. For the affine-penalty algorithm, it is necessary to keep track not only of coordinates but also of the movements between the matrices.

#### 4.1.4 Parameterization

In our implementation, we use the following parameterization:

SCORE_MATCH = 2

SCORE_MISMATCH = -5

SCORE_GAP_OPEN = -4

SCORE_GAP_EXTEND = -2

#### 4.1.5 Implementation

The complexity of the described algorithm in in the class O((N_{Q} + 1) × L x max(Z_{l} × Z_{(l-1)})) – i.e. practically inapplicable to problems of the scale we are most interested in: PRGs with millions of levels and query sequences ranging from hundreds to tens of thousands of nucleotides.

We have thus developed a “seed and extend” approximation to the full algorithm, the key components of which we outline here.

Informally, we utilize stretches of sequence we can uniquely localize to constrain the alignment search space. Each such uniquely localized stretch relates to a particular subpath through the full alignment matrix, i.e. a defined combination of matches and gaps connecting one particular point in the alignment matrix with a second particular point in the alignment matrix. In order to complete the alignment, we need to connect these subpaths (a) to each other and (b) to the top-left and bottom-right corners of the matrix, at which point all graph levels and query sequence characters will have been incorporated.

##### 4.1.5.1 Step 1: Chaining

We scan through the query sequence from left to right and identify all exact and contiguous matches between subpaths in the PRG and the query sequence. Each subpath we refer to as a “chain”.

To speed up this process, we kMerify the PRG that we map to, keeping track of the subpath spanned by each kMer. We store kMers and corresponding subpaths in a hash table.

We require that each chain begin with an exact kMer match, and we extend each chain until we hit a mismatch. Of note, each kMer can initiate multiple chains.

##### 4.1.5.2 Step 2: Global chain filtering and fixing

Each chain specifies a path through the main scoring matrices. If the divergence between query sequence and PRG is not too high (which we assume as the PRG / chromotypes we map to contain many population / individual variants), connecting and extending these chains in a sensible manner should yield a good alignment.

We rate each chain by kMer double-uniqueness. That is, for each kMer we determine whether it occurs exactly once in the query sequence and exactly once in the PRG we map to. If both conditions are satisfied, we say that the kMer is double-unique. For each chain we determine the proportion of double-unique kMers, and we rank the chains according to this criterion. We store the ranked chains together with their proportion of uniqueness in a list structure we call `AVAILABLE_CHAINS`.

If the mapping algorithm is run in deterministic mode, we carry out the following steps:

Determine whether there are still chains in

`AVAILABLE_CHAINS`(optionally meeting the criterion that the absolute number of double-unique kMers per chain is above a certain threshold – we currently use 1) – if not, terminate.Select the highest-ranked chain from

`AVAILABLE_CHAINS`and store the corresponding alignment matrix subpath (i.e. the induced sequence of matches and gaps).Remove the selected chain from

`AVAILABLE_CHAINS`and remove all other chains from that set which are incompatible with the selection.Incompatibility can be induced 1) by basic alignment structure as well as 2) by the properties of the PRG we align to.

For the first point, consider fixing a chain which maps the first 10 kMers of the query sequence to levels 10–15 of the graph. It is clear that no following kMer (to the right of the ones already fixed) can be mapped to levels 1–9 of the graph.

For the second point, consider that not all edges in the PRG are necessarily reachable from all other edges, even if they are compatible in terms of levels. PRGs can, for example, contain long haplotypic paths with no connecting edges. If a fixed chain maps to the first of two such paths, no other chain can map to the second.

Go to step 1.

If the algorithm is run in probabilistic mode, step 2 is replaced with a probabilistic selection, the chains weighted by optimality.

##### 4.1.5.3 Step 3: Recursive local chain filtering and fixing

We now deal with the “squares” in between the fixed chains from Step 2 (and the area between the origin of the alignment matrix and the first chain, and between the last chain and the bottom-right corner of the alignment matrix).

Each fixed chain from Step 2 has start- and endpoints, defined in terms of their (q_{i}, l, z) coordinates. Each region between two chains from Step 2 has thus defined start- and endpoints with (q_{i}, l) coordinates assigned, and we refer to these regions as “squares” (the z coordinates matter, too, but we nevertheless we stick with the two-dimensional metaphor).

We now apply the algorithm from Step 2 to each square so-defined, modifying the measure of kMer uniqueness to only take into account uniqueness within the square (along the q_{i} and l coordinates, i.e. we reduce our notion of uniqueness both in terms of levels of the graph and in terms of the query sequence).

This will typically enable us to fix more chains, and we recursively repeat this procedure for each square in between the new sets of chains until we can't fix any more chains.

##### 4.1.5.4 Step 4: Chain extension

We now use the global graph alignment (described earlier in this section) algorithm to try to fill the space in the remaining squares (squares between chains from Step 2 and Step 3).

More formally, for each remaining square, we start running the global alignment algorithm at the start coordinates and (in reverse direction) at the end coordinates. We use affine gap penalties with the parameterization described earlier. We terminate the algorithm if either the square boundaries have been reached or if the total score has fallen below a certain threshold (-11). The algorithm also differs from the one described earlier in a couple of other points, most of them aimed at eliminating unpromising areas of the search space:

The scoring matrices are filled in diagonals. Each diagonal counts as one iteration, and the termination threshold refers to the maximum value achieved during the computation of one diagonal.

We also measure how many iterations (i.e., filled diagonals) ago the achieved maximum value over the complete search space was last increased, and if this number crosses a threshold (40), we terminate.

We measure the maximum value achieved in one diagonal, and we prune all cells in the diagonal if the difference between the maximum and the cell value (in M

_{D}) is bigger than a threshold (15). Pruned cells are not considered as sources for the recursion equations when computing the next diagonal.If the total maximum over the scanned area is achieved in multiple cells, we store all corresponding coordinates. If the reverse run started from the end coordinates hits one of the maximum points of the forward run started from the start coordinates, we also store the coordinates and the achieved score.

When the forward and backward extension runs have terminated, we have obtained

a maximum achieved score for each run and where this was achieved

(potentially) where the reverse run hit the maxima of the forward run, and the associated scores.

For notational convenience, let R_{F} denote the set of coordinates from (1) and (2) from the forward run, and let R_{B} denote the set of coordinates from (1) and (2) from the backward run. If the total maximum (1), however, is < 0 for a run, we will only include the start coordinates of the run in the corresponding set.

We now consider the cross product R_{F} × R_{B}. For each combination of ends points {(*f*_{qi}, *f*_{l}, *f*_{z}), (*b*_{qi}, *b*_{l}, *b*_{z})} ∈ (R_{F} × R_{B}), we

examine whether the combination of forward- and backward-derived coordinates is compatible, i.e. whether it is possible to connect the forward end point (

*f*_{qi},*f*_{l},*f*_{z}) to the backward end point (*b*_{qi},*b*_{l},*b*_{z}) via a PRG-consistent alignment path in positive direction along the l and q_{i}coordinates.(This path is typically a long sequence gap followed by a graph gap, or vice versa, unless

*f*_{qi}=*b*_{qi}or (*f*_{l},*f*_{z}) = (*b*_{l},*b*_{z})).If there is no such path, we discard this combination.

If there is such a path, however, we compute the score for

the path from the forward extension start point (i.e. the top-left corner of the square) to the particular forward extension end point (

*f*_{qi},*f*_{l},*f*)_{z}the path from (

*f*_{qi},*f*_{l},*f*), to (_{z}*b*_{qi},*b*_{l},*b*)_{z}the path from (

*b*_{qi},*b*_{l},*b*) to the backward extension start point (i.e. the bottom-right corner of the square)._{z}

The three paths combined connect the start coordinate of the square to the end coordinate of the square, and the three scores combined determine the optimality of that particular way of connecting the coordinates. In summing up the scores, we need to keep track of where affine gaps open and close.

We finally select the combination that achieved the maximum score and use the corresponding path to connect the start coordinate of the square to the end coordinate of the square. (If multiple combinations achieve the same score, we make a random selection; if no combinations are compatible, we connect the start and the end coordinate of the square with gaps).

##### 4.1.5.5 Step 5: Backtracking

We choose max_{z ∈{1..ZL−1}} *M*_{D}(*N*_{Q}, *L* – 1, *z*) as the final score of the algorithm, and we backtrack accordingly.

##### 4.1.5.6 Further points

To speed up the algorithm, in particular for short or medium-sized query sequences, it can be helpful to omit the computation of the optimal gap paths between the top-left corner of the alignment matrix and the first chain and the last chain and the bottom-right corner of the alignment matrix.

To see why this makes sense, consider aligning a fragment of 1000 bases against a PRG of 5m bases: In the resulting global alignment, approximately 4.999m positions would be used to specify the gaps before and after the query sequence, and computing these gaps would typically be much more resource-intensive than the parts relating to the query sequence.

(Graph alignment requires traversal of the graph, even if a sequence gap is to be inserted at a particular position – for the final score is influenced by whether the sequence gaps sit below edges labeled with the “gap” symbol or not).

The algorithm described in Step 2 (and hence that for Step 3) can, as specified, be run in probabilistic mode. We typically carry out one “deterministic” run (always selecting the chain with maximum double-uniqueness) and a number of “probabilistic” iterations (randomly selecting from a uniqueness-weighted selection of chains).

As final result we select the iteration that achieved the highest score, and compare across iterations to compute measures of confidence.

For example, for each character in the query sequence, we count in how many iterations it ends up being assigned to the same level of the PRG as in the chosen maximum iteration. This (to an extent) quantifies the uncertainty in placement of query characters.

### 4.2 Validation

#### 4.2.1 Chromotype disentanglement for kMer recovery validation

For any two (or more) positions at which a chromotype loses phase (i.e. at nodes with two outgoing edges) within distance k (where k is the chosen kMer length for validating the chromotype against some sequencing data), we need to disentangle the chromotype prior to validation.

After disentanglement, a chromotype induces a set of well-defined kMers, which we expect to find in sample sequencing data if the chromotype is a correct representation of sequence present in the sample.

We employ a simple greedy algorithm for disentanglement. As our final criterion for assessing a chromotype is how many of the kMers can be recovered from sample data, our disentanglement procedure (locally) optimizes for this criterion.

To prepare for disentanglement, we compartmentalize our chromotype so that all stretches between the phase breakpoints become one compartment (i.e. the compartments are separated by nodes with more than one outgoing edge). Each compartment can be either homozygous or heterozygous, depending on whether the stretch it spans has one or two nodes at each level.

Now we move through the chromotype from the left to the right, carrying with us a set of unresolved haplotype pairs. If the first compartment of the chromotype is homozygous with sequence s_{1}, we initialize the set HAPSET of pairs of haplotypes as {(s_{1}, s_{1})}, or as {(s_{1}, s_{2})} if it is diploid with sequences s_{1} and s_{2}.

Moving from compartment i to compartment i + 1, we carry out the following procedure:

If compartment i + 1 is homozygous, append sequence s

_{1}of compartment i + 1 to all haplotype pairs in HAPSET.If s

_{1}contained k or more non-gap characters, call REDUCE.Set i = i + 1.

If compartment i + 1 is heterozygous, append sequences s

_{1}and s_{2}to all members of HAPSET, once matching s_{1}with the first member of each haplotype pair and once with the second. That is, HAPSET doubles in size.If HAPSET has more than 100,000 members, call RESOLVE.

Set i = i + 1.

Before defining REDUCE and RESOLVE, we define OPTIMALITY((s_{1}, s_{2})). OPTIMALITY computes the optimality of a haplotype pair (s_{1}, s_{2}) by

Removing all gaps from (s

_{1}, s_{2}).Dividing the number of kMers in (s

_{1}, s_{2}) found in the sample data by the total number of kMers in (s_{1}, s_{2}).If there are 0 kMers in (s

_{1}, s_{2}) (e.g. because both sequences have length smaller than k), OPTIMALITY is defined as 0.

RESOLVE orders HAPSET according to the values returned from OPTIMALITY and eliminates all members of HAPSET but the one with the best score.

REDUCE orders HAPSET according to the values returned from OPTIMALITY and retains the 1000 best-scoring haplotype pairs.

After completing this algorithm for the last compartment (we need to call RESOLVE if HAPSET has more than one member), the chromotype is disentangled into two strings (the members of the only remaining element of HAPSET), which specify an unambiguous set of kMers. We call the chromotype equivalent to these two strings (i.e. one start vertex and two separate, non-connected branches encoding the two strings) the disentangled chromotype.

#### 4.2.2 Identification of xMHC-specific contigs

For the Moleculo-based validation, we want to identify contigs that have originated from NA12878’s xMHC region. We thus filter and trim the raw contig sequence data prior to alignment, according to the following criteria:

We compute the set of all kMers (k = 31) occurring in the kMerified xMHC PRG. We call all kMers occurring in this set “xMHC kMers”.

We also compute the set of all kMers (k = 31) occurring in the human reference genome, excluding the region covered by the xMHC PRG. We call all kMers in this set “reference kMers”. Note that some kMers are both xMHC kMers and reference kMers. We call kMers which are xMHC kMers but not reference kMers “xMHC-unique kMers”.

We filter contigs according to the following criteria:

Fraction xMHC kMers >= 0.8

There are two xMHC-unique kMers spanning a stretch of at least 50 bases (in between the two kMers). For each contig, we select the maximum stretch MAXSTRETCH spanned by two such xMHC-unique kMers.

Within MAXSTRETCH, fraction of xMHC-unique kMers >= 0.5

Within MAXSTRETCH, fraction of reference kMers <= 0.3

If a contig passes these tests, we truncate the contig to MAXSTRETCH and align MAXSTRETCH.

### 4.3 Important symbols and abbreviations

- A
- Alignment A
- Q=(q
_{1},…,q_{NQ}) - Query sequence of N
_{Q}characters - A
_{all}(Q, G) - Set of alignments between Q and G.
- PRG
- Population Reference Graph
- COV
- Catalogue of Variation

- G
- The specific PRG
- V
- Set of vertices
- E
- Set of edges
- P
_{n}(e) - Edge probability distribution at node n
- P
_{sub} - Set of all subpaths
- P
_{traversal} - Set of all subpaths, constrained to complete traversals
- V
_{m} - Two vertices
- v
_{n} - e
- One edge
- l(v)
- The level of vertex v

- L
- Scaffold haplotype MSA length; last level of haplotype graph
- H(v)
- The set of scaffold haplotypes attached to v
- K(v)
- The set of kMer-edges attached to v
- c
_{v} - Current vertex

- r
- “Recombination” parameter
- S
_{N} - Number of scaffold haplotypes
- S
_{n,i} - i-th position (MSA) of haplotype n
- O
_{i} - Set of kMers output from level i
- o(kMer)
- Sample count of kMer
*kMer* - x
- Generic variable
- X
- Additional variant specifiers.

*suffix*(v, r)- Suffix function for vertex v of length r

- Q
- Alignment query sequence
- N
_{Q} - Length of Q
- q
_{i} - Index for Q
- Q’
- Aligned query sequence
- E’
- Aligned edge sequence
- A
_{L} - Alignment length
- M
- Alignment scoring matrix
- Z
_{l} - Number of nodes at level
- node(l, z)
- Retrieve node z at level l.

## 5 Experimental details

### 5.1 Simulation

We carry out simulations by independently generating two paths through the PRG (uniform choices at junctions) and treating these as a sample’s underlying diploid genome. We concatenate the edge labels induced by each path, remove “gap” characters and use the strings so-generated as a sample’s two haplotypes to generate reads from. At each position, the number of starting reads (read length 85 bp) is given by a Poisson distribution, parameterized to match an expected total coverage of 30x. Accuracy is assessed by comparing the diploid true underlying genotype at each level of the PRG with the diploid genotype induced by the Viterbi genotype computed from the simulated reads. Our simulations are limited in that a) we ignore read error (the main effect of which is a slight reduction of coverage) and b) we treat the simulated paths as a sample’s complete genome.

### 5.2 Graph for the extended MHC

We define the extended MHC (xMHC) as the genomic region spanned by the “PGF” xMHC haplotype (identical to the canonical human reference in the region – in B37 coordinates: chr6:28,702,185-33,451,429).

We use the eight xMHC haplotypes from the MHC haplotype project [8] as scaffold haplotypes for the region. We create an MSA for the eight haplotypes using the programs FSA [9] and MAFFT for refinement [10] . We use the SNPs identified by the 1000 Genomes Project, Phase 1, Release 3 [11] as additional variant specifiers for the eight MHC haplotypes.

We also use all available genomic HLA allele sequences from IMGT/HLA ([12], http://www.ebi.ac.uk/ipd/imgt/hla/, downloaded on 29/02/2012) for the classical HLA alleles at the loci *HLA-A*, *-B*, *-C*, *-DQA1*, *-DQB1*, *-DRB1* as additional scaffold haplotypes (these scaffold haplotypes cover all exons and introns of the genes – for many known alleles, the genetic sequences are not completely specified over all exons and introns, but the PRG construction algorithm we have defined removes most of the wildcard characters found at the unspecified positions). We do not specify any additional variant specifiers for the classical HLA genes.

The edge probability distributions we specify at each vertex in the PRG are mostly improper. Specifically, we assign probability 1 to each edge. This is motivated by the downstream parts of our pipeline: we mostly rely on the Viterbi algorithm for inferring Maximum Likelihood personalized haplotypes. With the specified improper parameterization, each possible path through the model is equally likely under the Viterbi algorithm, independent of how many potential branching points (vertices where there is more than one possible edge to follow) it contains.

We use kMer length k = 31 for creating the kMer-PRG.

#### 5.2.1 Ensembl inconsistency

In the process of examining available annotation information for the eight xMHC haplotypes, we discovered an inconsistency in the Ensembl database [13]. On the SSTO haplotype, *HLA-DRB1* and *HLA-DRB4* were mapped to the same start coordinate, likely caused, according to Ensembl, by a mismapping of exonic sequence of the two transcripts ENST00000549627 and ENST00000548105 (*HLA-DRB4* and *HLA-DRB1* exon sequence is similar). The two transcripts will be deleted in release 72 or release 73 (*personal communication*).

### 5.3 Sample details and HLA types

CS1 and CS2-6 samples:

Next-generation sequencing data generated as described in the main text. Sample read data available at request from GlaxoSmithKline.

NA12878:

Next-generation sequencing for NA12878 from the Illumina Platinum genomes project (www.illumina.com/platinumgenomes/) was downloaded from the EBI (www.ebi.ac.uk/ena/data/view/ERP001775). Sample data details described in the main text.

Sample HLA types (reported to 4-digit accuracy using ‘g’ nomenclature):

## Acknowledgements

Funded by grants from GSK and 100956/Z/13/Z from the Wellcome Trust to GM, a Nuffield Department of Medicine Fellowship to ZI, and a Sir Henry Dale Fellowship jointly awarded by the Wellcome Trust and the Royal Society to ZI (102541/Z/13/Z). We thank Mike Eberle and colleagues at Illumina for early access to the Moleculo data.

## References

## 6 References

- 1.
- 2.
- 3.
- 4.
- 5.
- 6.
- 7.
- 8.
- 9.
- 10.
- 11.
- 12.
- 13.