Nucleotide-resolution bacterial pan-genomics with reference graphs

Background Bacterial genomes follow a U-shaped frequency distribution whereby most genomic loci are either rare (accessory) or common (core); the union of these is the pan-genome. The alignable fraction of two genomes from a single species can be low (e.g. 50-70%), such that no single reference genome can access all single nucleotide polymorphisms (SNPs). The pragmatic solution is to choose a close reference, and analyse SNPs only in the core genome. Given much bacterial adaptability hinges on the accessory genome, this is an unsatisfactory limitation. Results We present a novel pan-genome graph structure and algorithms implemented in the software pandora, which approximates a sequenced genome as a recombinant of reference genomes, detects novel variation and then pan-genotypes multiple samples. The method takes fastq as input and outputs a multi-sample VCF with respect to an inferred data-dependent reference genome, and is available at https://github.com/rmcolq/pandora. Constructing a reference graph from 578 E. coli genomes, we analyse a diverse set of 20 E. coli isolates. We show pandora recovers at least 13k more rare SNPs than single-reference based tools, achieves equal or better error rates with Nanopore as with Illumina data, 6-24x lower Nanopore error rates than other tools, and provides a stable framework for analysing diverse samples without reference bias. We also show that our inferred recombinant VCF reference genome is significantly better than simply picking the closest RefSeq reference. Conclusions This is a step towards comprehensive cohort analysis of bacterial pan-genomic variation, with potential impacts on genotype/phenotype and epidemiological studies.

We set out to define a generalised reference structure which allows detection of SNPs and 22 other variants across the whole pan-genome, without attempting to record long-range 23 structure or coordinates. We define a Pan-genome Reference Graph (PanRG) as an 24 unordered collection of sequence graphs, termed local graphs , each of which represents a 25 locus, such as a gene or intergenic region. Each local graph is constructed from a MSA of 26 known alleles of this locus, using a recursive cluster-and-collapse (RCC) algorithm 27 (Supplementary Animation 1: recursive clustering construction). The output is guaranteed to 28 be a directed acyclic sequence graph allowing hierarchical nesting of genetic variation while 29 meeting a "balanced parentheses" criterion (see Figure 2b and Methods). Each path through 30 the graph from source to sink represents a possible recombinant sequence for the locus. 31 The disjoint nature of this pan-genome reference allows loci such as genes to be compared 32 regardless of their wider genomic context. We implement this construction algorithm in the 33 make_prg tool which outputs the graph as a file (see Figures 2a-c, Methods). Subsequent 34 operations, based on this, are implemented in the software package pandora . The overall 35 workflow is shown in Figure 2. 36 To index a PanRG, we generalise a type of sparse marker k-mer ((w,k)-minimizer), 37 previously defined for strings, to directed acyclic graphs (see Methods). Each local graph is 38 sketched with minimizing k-mers, and these are then used to construct a new graph (the 1 k-mer graph) for each local graph from the PanRG. Each minimizing k-mer is a node, and 2 edges are added between two nodes if they are adjacent minimizers on a path through the 3 original local graph. This k-mer graph is isomorphic to the original if (and outside the ≤k w 4 first and last w+k-1 bases); all subsequent operations are performed on this graph, which, to 5 avoid unnecessary new terminology, we also call the local graph. 6 A global index maps each minimizing k-mer to a list of all local graphs containing that k-mer 7 and the positions therein. Long or short reads are approximately mapped ( quasi-mapped ) to 8 the PanRG by determining the minimizing k-mers in each read. Any of these read 9 quasi-mappings found in a local graph are called hits , and any local graph with sufficient 10 clustered hits on a read is considered present in the sample.
11 Figure 2. The pandora workflow. a) reference panel of genomes; colour signifies locus 12 (gene or intergenic region) identifier, and blobs are SNPs. b) multiple sequence alignments 13 (MSAs) for each locus are made and converted into a directed acyclic graph. c) local graphs 14 constructed from the loci in the reference panel. d) Workflow: the collection of local graphs, 15 termed the PanRG, is indexed. Reads from each sample under study are independently 16 quasi-mapped to the graph, and a determination is made as to which loci are present in 17 each sample. In this process, for each locus, a mosaic approximation of the sequence for 18 that sample is inferred, and variants are genotyped. e) regions of low coverage are detected, 19 and local de novo assembly is used to generate candidate novel alleles missing from the 1 graph. Returning to d), the dotted line shows all the candidate alleles from all samples are 2 then gathered and added to the MSAs at the start, and the PanRG is updated. Then, reads 3 are quasi-mapped one more time, to the augmented PanRG, generating new mosaic 4 approximations for all samples and storing coverages across the graphs; no de novo 5 assembly is done this time. Finally, all samples are compared, and a VCF file is produced, 6 with a per-locus reference that is inferred by pandora. 7 Initial sequence approximation as a mosaic of references 8 For each locus identified as present in a sample, we initially approximate the sample's 9 sequence as a path through the local graph. The result is a mosaic of sequences from the 10 reference panel. This path is chosen to have maximal support by reads, using a dynamic 11 programming algorithm on the graph induced by its (w,k)-minimizers (details in Methods). 12 The result of this process serves as our initial approximation to the genome under analysis. 13 Improved sequence approximation: modify mosaic by local assembly 14 At this point, we have quasi-mapped reads, and approximated the genome by finding the 15 closest mosaic in the graph; however, we expect the genome under study to contain variants 16 that are not present in the PanRG. Therefore, to allow discovery of novel SNPs and small 17 indels that are not in the graph, for each sample and locus we identify regions of the inferred 18 mosaic sequence where there is a drop in read coverage (as shown in Figure 2e). Slices of 19 overlapping reads are extracted, and a form of de novo assembly is performed using a de 20 Bruijn graph. Instead of trying to find a single correct path, the de Bruijn graph is traversed 21 (see Methods for details) to all feasible candidate novel alleles for the sample. These alleles 22 are added to the reference MSA for the locus, and the local graph is updated. If comparing 23 multiple samples, the graphs are augmented with all new alleles from all samples at the 24 same time.
25 Optimal VCF-reference construction for multi-genome comparison 26 In the compare step of pandora (see Figure 2d), we enable continuity of downstream 27 analysis by outputting genotype information in the conventional VCF (21) . In this format, each 28 row (record) describes possible alternative allele sequence(s) at a position in a (single) 29 reference genome and information about the type of sequence variant. A column for each 30 sample details the allele seen in that sample, often along with details about the support from 31 the data for each allele.
1 Figure 3. The representation problem. a) a local graph. b) The black allele is chosen as 2 reference to enable representation in VCF. The blue/red SNP then requires flanking 3 sequence in order to allow it to have a coordinate. The SNP is thus represented as two ALT 4 alleles, each 3 bases long, and the user is forced to notice they only differ in one base. c) 5 The blue path is chosen as the reference, thus enabling a more succinct and natural 6 representation of the SNP. 7 To output graph variation, we first select a path through the graph to be the reference 8 sequence and describe any variation within the graph with respect to this path as shown in 9 Figure 3. We use the chromosome field to detail the local graph within the PanRG in which a 10 variant lies, and the position field to give the position in the chosen reference path sequence 11 for that graph. In addition, we output the reference path sequences used as a separate file.
12 For a collection of samples, we want small differences between samples to be recorded as 13 short alleles in the VCF file rather than longer alleles with shared flanking sequence as 14 shown in Figure 3b. We therefore choose the reference path for each local graph to be 15 maximally close to the sample mosaic paths. To do this, we make a copy of the k-mer graph 16 and increment the coverage along each sample mosaic path, producing a graph with higher 17 weights on paths shared by more samples. We reuse the mosaic path-finding algorithm (see 18 Methods) with a modified probability function defined such that the probability of a node is 19 proportional to the number of samples covering it. This produces a dataset-dependent VCF 20 reference able to succinctly describe segregating variation in the cohort of genomes under 21 analysis. 22 Constructing a PanRG of E. coli 23 We chose to evaluate pandora on the recombining bacterial species, E. coli , whose 24 pan-genome has been heavily studied (7,(23)(24)(25)(26) . MSAs for gene clusters curated with 25 PanX (27) from 350 RefSeq assemblies were downloaded from http://pangenome.de on 3rd 26 May 2018. MSAs for intergenic region clusters based on 228 E. coli ST131 genome 27 sequences were previously generated with Piggy (28) for their publication. Whilst this panel 28 of intergenic sequences does not reflect the full diversity within E. coli , we included them as 29 an initial starting point. This resulted in an E. coli PanRG containing local graphs for 23,054 30 genes and 14,374 intergenic regions. Pandora took 24.4h in CPU time (2.3h in runtime with 31 16 threads) and 12.6 GB of RAM to index the PanRG. As one would expect from the 32 U-shaped gene frequency distribution, many of the genes were rare in the 578 (=350+228) 1 input genomes, and so 59%/44% of the genic/intergenic graphs were linear, with just a 2 single allele. 3 Constructing an evaluation set of diverse genomes 4 We first demonstrate that using a PanRG reduces hard bias when comparing a diverse set 5 of 20 E. coli samples by comparison with standard single reference variant callers. We 6 selected samples from across the phylogeny (including phylogroups A, B2, D and F (29) ) 7 where we were able to obtain both long and short read sequence data from the same 8 isolate.   4 these input data, we show results for Pandora's first approximation to a genome as a mosaic 5 (recombinant) of the input reference panel (mosaic, light blue), and then the improved 6 approximation with added de novo discovery (mosaic+de novo, dark blue). 7 The top right of each curve corresponds to completely unfiltered results; increasing the 8 genotype confidence threshold (see Methods) moves each curve towards the bottom-left, 9 increasing precision at the cost of recall. Notably, with normal basecalling, local de novo 10 assembly increases the error rate from 0.53% to 0.67%, with a negligible increase in recall, 11 from 88.7% to 89.3%, whereas with methylation-aware basecalling it increases the recall 12 from 89.1% to 90% and slightly decreases the error rate from 0.49% to 0.48%. On the basis 13 of this, from here on we work entirely with reads that are basecalled with a 14 methylation-aware model, and move to the full dataset of 20 samples. 15 Benchmarking recall, error rate and dependence on reference 16 We show in Figures 6a,b the Illumina and Nanopore AvgAR/recall plots for pandora and four 17 single-reference tools with no filters applied. For all of these, we modify only the minimum 18 genotype confidence to move up and down the curves (see Methods).
1 are hard to interpret because pandora and the reference-based tools have recall that varies 2 differently across the locus frequency spectrum -we explore this further below. 3 We ascribe the similarity between the Nanopore and Illumina performance of pandora to 4 three reasons. First, the PanRG is a strong prior -our first approximation does not contain 5 any Nanopore sequence, but simply uses quasi-mapped reads to find the nearest mosaic in 6 the graph. Second, mapping long Nanopore reads which completely cover entire genes is 7 easier than mapping Illumina data, and allows us to filter out erroneous k-mers within reads 8 after deciding when a gene is present. Third, this performance is only achieved when we use 9 methylation-aware basecalling of Nanopore reads, presumably removing most systematic 10 bias (see Figure 5). 11 In Figure 6c,d we show for Illumina and Nanopore data, the impact of reference choice on 12 the precision of calls on each of the 20 samples. While precision is consistent across all 13 samples for pandora , we see a dramatic effect of reference-choice on precision of SAMtools , 14 medaka and nanopolish . The effect is also detectable for snippy , but to a much lesser 15 extent. 16 Finally, we measured the performance of locus presence detection, restricting to 17 genes/intergenic regions in the PanRG, so that in principle perfect recall would be possible 18 (see Methods). In Supplementary Figure 3 we show the distribution of locus presence calls 19 by pandora , split by length of locus for Illumina and Nanopore data. Overall, 93.8%/94.3% of 20 loci were correctly classified as present or absent for Illumina/Nanopore respectively. 21 Misclassifications were concentrated on small loci (below 500bp). While 59.2%/57.4% of all 22 loci in the PanRG are small, 75.5% /74.8% of false positive calls and 98.7%/98.1% of false 23 negative calls are small loci (see Supplementary Figure 3). 24 Pandora detects rare variation inaccessible to single-reference methods 25 Next, we evaluate the key deliverable of pandora -the ability to access genetic variation 26 within the accessory genome. We plot this in Figure 7, showing PVR of SNPs in the truth set 27 which overlap genes or intergenic regions from the PanRG, broken down by the number of 28 samples the locus is present in.  8 If we restrict our attention to rare variants (present only in 2-5 genomes), we find pandora 9 recovers at least 19644/26674/13108/22331 more SNPs than 10 SAMtools/snippy/medaka/nanopolish respectively. As a proportion of rare SNPs in the truth 11 set, this is a lift in PVR of 12/17/8/14% respectively. If, instead of pan-variant recall, we look 12 at the variation of AvgAR across the locus frequency spectrum (see Supplementary Figure  13 4), the gap between pandora and the other tools on rare loci is even larger. These 14 observations, and Figure 6, confirm and quantify the extent to which we are able to recover 15 accessory genetic variation that is inaccessible to single-reference based methods. 16 Pandora has consistent results across E. coli phylogroups 17 We measure the impact of reference bias (and population structure) by quantifying how 18 recall varies in phylogroups A, B2, D, and F dependent on whether the reference genome 19 comes from the same phylogroup.
1 Figure 9. Sharing of variants present in precisely 2 genomes, showing which pairs of 2 genomes they lie in and which phylogroups; darker colours signify higher counts (log 3 scale). Genomes are coloured by their phylogroup (see Figure 4 inset). 4 These results will, however, be dominated by the shared, core genome. If we replot Figure  5 8a, restricting to variants in loci present in precisely 2 genomes (abbreviated to 2-variants; 6 Figure 8b), we find that pandora achieves 50-84% recall for each sample (complete data in 7 Supplementary Figure 9). By contrast, for any choice of reference genome, the results for 8 single-reference callers vary dramatically per sample. Most samples have recall under 25%, 9 and there is no pattern of improved recall for samples in the same phylogroup as the 10 reference. Following up that last observation, if we look at which pairs of genomes share 11 2-variants ( Figure 9), we find there is no enrichment within phylogroups at all. This simply 12 confirms in our data that presence of rare loci is not correlated with the overall phylogeny. 13 Pandora VCF reference is closer to samples than any single reference 14 The relationship between phylogenetic distance and gene repertoire similarity is not linear. In 15 fact, 2 genomes in different phylogroups may have more similar accessory genes than 2 in 16 the same phylogroup -as illustrated in the previous section (also see Figure 3 in Rocha (3) ). 17 As a result, it is unclear a priori how to choose a good reference genome for comparison of 18 accessory loci between samples. Pandora specifically aims to construct an appropriate 19 reference for maximum clarity in VCF representation. We evaluate how well pandora is able 1 to find a VCF reference close to the samples under study as follows. We first identified the 2 location of all loci in all the 20 sample assemblies and the 24 references (see Methods).
3 Figure 10. How often do references closely approximate a sample? pandora aims to 4 infer a reference for use in its VCF, which is as close as possible to all samples. We evaluate 5 the success of this here. The x-axis shows the number of genomes in which a locus occurs. 6 The y-axis shows the (log-scaled) count of loci in the 20 samples that are within 1% edit 7 distance (scaled by locus length) of each reference -box plots for the reference genomes, 8 and line plot for the VCF reference inferred by pandora. 9 We then measured the edit distance between each locus in each of the references and the 10 corresponding version in the 20 samples. We found that the pandora 's VCF-reference lies 11 within 1% edit distance (scaled by locus length) of the sample far more than any of the 12 references for loci present in <=14 samples ( Figure 10; note the log scale). The 13 improvement is much reduced in the core genome; essentially, in the core, a 14 phylogenetically close reference provides a good approximation, but it is hard to choose a 15 single reference that provides a close approximation to all rare loci. By contrast, pandora is 16 able to leverage its reference panel, and the dataset under study, to find a good 17 approximation. 1 Computational performance 2 Performance measurements for single-sample analysis by pandora and benchmarked tools 3 are shown in Supplementary Table 3. In short, pandora took 3-4 hours per sample (using 16 4 cores and up to 10.7 GB of RAM), which was slower than snippy (0.1h, 4 cores), SAMtools 5 (0.3h, 1 core) and medaka (0.3h, 4 cores), but faster than nanopolish (4.6h, 16 cores). 6 Pandora alone can do joint analysis of multiple samples and this is currently the most 7 expensive pandora step. Parallelising by gene on a compute cluster, it took 8 hours to 8 augment the PanRG with novel alleles. This was dominated by the Python implementation of 9 the RCC clustering algorithm (see Methods) and the use of Clustal Omega (35) for MSA. 10 90% of loci required less than 30 minutes to process, and the remainder took less than 2 11 hours (see Methods). We discuss below how this could be improved. Finally, it took 28/46 12 hours to compare the samples (produce the joint VCF file) for Illumina/Nanopore. Mapping 13 comprised ~10% of the Illumina time, and ~50% of the Nanopore time. Dynamic 14 programming and genotyping the VCF file took ~90% of the Illumina time, and ~50% of the 15 Nanopore time. 16

Discussion
17 Bacteria are the most diverse and abundant cellular life form (36) . Some species are 18 exquisitely tuned to a particular niche (e.g. obligate pathogens of a single host) while others 19 are able to live in a wide range of environments (e.g. E. coli can live on plants, in the earth, 20 or commensally in the gut of various hosts). Broadly speaking, a wider range of 21 environments correlates with a larger pan-genome, and some parts of the gene repertoire 22 are associated with specific niches (37) . Our perception of a pan-genome therefore depends 23 on our sampling of the unknown underlying population structure, and similarly the 24 effectiveness of a PanRG will depend on the choice of reference panel from which it is built. 25 Many examples from different species have shown that bacteria are able to leverage this 26 genomic flexibility, adapting to circumstance sometimes by using or losing novel genes 27 acquired horizontally, and at other times by mutation. There are many situations where 28 precise nucleotide-level variants matter in interpreting pan-genomes. Some examples 29 include: compensatory mutations in the chromosome reducing the fitness burden of new 30 plasmids (38-40) ; lineage-specific accessory genes with SNP mutations which distinguish 31 carriage from infection (41) ; SNPs within accessory drug resistance genes leading to 32 significant differences in antibiograms (42) ; and changes in CRISPR spacer arrays showing 33 immediate response to infection (43,44) . However, up until now there has been no automated 34 way of studying non-core gene SNPs at all; still less a way of integrating them with gene 35 presence/absence information. Pandora solves these problems, allowing detection and 36 genotyping of core and accessory variants. It also addresses the problem of what reference 37 to use as a coordinate system, inferring a mosaic "VCF reference" which is as close as 38 possible to all samples under study. We find this gives more consistent SNP-calling than any 39 single reference in our diverse dataset. We focussed primarily on Nanopore data when 40 designing pandora , and show it is possible to achieve higher quality SNP calling with this 1 data than with current Nanopore tools. Together, these results open the door for empirical 2 studies of the accessory genome, and for new population genetic models of the pan-genome 3 from the perspective of both SNPs and gene gain/loss. 4 Prior graph genome work, focussing on soft reference bias (in humans), has evaluated 5 different approaches for selecting alleles for addition to a population graph, based on 6 frequency, avoiding creating new repeats, and avoiding exponential blowup of haplotypes in 7 clusters of variants (45) . This approach makes sense when you have unphased diploid VCF 8 files and are considering all recombinants of clustered SNPs as possible. However, this is 9 effectively saying we consider the recombination rate to be high enough that all 10 recombinants are possible. Our approach, building from local MSAs and only collapsing 11 haplotypes when they agree for a fixed number of bases, preserves more haplotype 12 structure and avoids combinatorial explosion. Another alternative approach was recently 13 taken by Norri et al. (46) , inferring a set of pseudo founder genomes from which to build the 14 graph.
15 Another issue is how to select the reference panel of genomes in order to minimize hard 16 reference bias. One cannot escape the U-shaped frequency distribution; whatever reference 17 panel is chosen, future genomes under study will contain rare genes not present in the 18 PanRG. Given the known strong population structure in bacteria, and the association of 19 accessory repertoires with lifestyle and environment, we would advocate sampling by 20 geography, host species (if appropriate), lifestyle (e.g. pathogenic versus commensal) and/or 21 environment. In this study we built our PanRG from a biassed dataset (RefSeq) which does 22 not attempt to achieve balance across phylogeny or ecology, limiting our pan-variant recall to 23 49% for rare variants (see Figure 7c,d). A larger, carefully curated input panel, such as that 24 from Horesh et al (47) , would provide a better foundation and potentially improve results. 25 A natural question is then to ask if the PanRG should continually grow, absorbing all variants 26 ever encountered. From our perspective, the answer is no -a PanRG with variants at all 27 non-lethal positions would be potentially intractable. The goal is not to have every possible 28 allele in the PanRG -no more than a dictionary is required to contain absolutely every word 29 that has ever been said in a language. As with dictionaries, there is a trade-off between 30 completeness and utility, and in the case of bacteria, the language is far richer than English. 31 The perfect PanRG contains the vast majority of the genes and intergenic regions you are 32 likely to meet, and just enough breadth of allelic diversity to ensure reads map without too 33 many mismatches. Missing alleles should be discoverable by local assembly and added to 34 the graph, allowing multi-sample comparison of the cohort under study. This allows one to 35 keep the main PanRG lightweight enough for rapid and easy use. 36 We finish with three potential applications of pandora . First, the PanRG should provide a 37 more interpretable substrate for pan-genome-wide Genome-Wide Association Studies, as 38 current methods are forced to either ignore the accessory genome or reduce it to k-mers or 39 unitigs (48-50) . Second, if performing prospective surveillance of microbial isolates taken in a 40 hospital, the PanRG provides a consistent and unchanging reference, which will cope with 41 the diversity of strains seen without requiring the user to keep switching reference genome. 1 In a sense it behaves similarly to whole-genome Multi-Locus Sequence Typing 2 (wgMLST) (51) , with more flexibility, support for intergenic regions, and without the 3 all-or-nothing behaviour when alleles have a novel SNP. Third, if studying a fixed dataset 4 very carefully, then one may not want to use a population PanRG, as it necessarily will miss 5 some rare accessory genes in the dataset. In these circumstances, one could construct a 6 reference graph purely of the genes/intergenic regions present in this dataset. 7 There are a number of limitations to this study. Firstly, pandora is not yet a fully-fledged 8 production tool. There are two steps that constitute bottlenecks in terms of RAM and speed. 9 The RCC algorithm used for local graph construction is currently implemented in Python. 10 However, the underlying algorithm is amenable to a much higher performance 11 implementation, which is now in progress. Also, we use Clustal Omega (35) for the MSA 12 stage, and there are faster options which we could use, including options for augmenting an 13 MSA without a complete rebuild (e.g. MAFFT), which is exactly what we need after local 14 assembly discovers novel alleles. Secondly, we do not see any fundamental reason why the 15 pandora error rate should be worse than Snippy on Illumina data (see Figure 6C), and will be 16 working to improve this. Finally, by working in terms of atomic loci instead of a monolithic 17 genome-wide graph, pandora opens up graph-based approaches to structurally diverse 18 species (and eases parallelisation) but at the cost of losing genome-wide ordering. At 19 present, ordering can be resolved by (manually) mapping pandora -discovered genes onto 20 whole genome assemblies. However the design of pandora also allows for gene-ordering 21 inference: when Nanopore reads cover multiple genes, the linkage between them is stored in 22 a secondary de Bruijn graph where the alphabet consists of gene identifiers. This results in a 23 huge alphabet, but the k-mers are almost always unique, dramatically simplifying "assembly" 24 compared with normal DNA de Bruijn graphs. This work is still in progress and the subject of 25 a future study. In the meantime, pandora provides new ways to access previously hidden 26 variation. 27 Conclusions 28 The algorithms implemented in pandora provide, to our knowledge, the first solution to the 29 problem of analysing core and accessory genetic variation across a set of bacterial 30 genomes. This study demonstrates as good SNP genotype error rates with Nanopore as 31 with Illumina data and improved recall of accessory variants. It also shows the benefit of an 32 inferred VCF reference genome over simply picking from RefSeq. The main limitations were 33 the use of a biassed reference panel (RefSeq) for building the PanRG, and the 34 comparatively slow performance of one module, currently implemented in Python -both of 35 which are addressable, not fundamental limitations. This opens the door to improved 36 analyses of many existing and future bacterial genomic datasets.
1 Methods 2 Local graph construction 3 We construct each local graph in the PanRG from an MSA using an iterative partitioning 4 process. The resulting sequence graph contains nested bubbles representing alternative 5 alleles. [ j s(a) 8 obtained by removing all non-AGCT symbols. We can partition alignment either vertically A 9 by partitioning the interval or horizontally by partitioning the set of rows of . In both 0, ) [ n A 10 cases, the partition induces a number of sub-alignments.

For vertical partitions, we define
We say that interval is a x a i 20 the unique m -mers in . For clusters , the inertia is defined as where is the mean of cluster .
x a j 23 The recursive algorithm first partitions an MSA vertically into match and non-match intervals. 24 Match intervals are collapsed down to the single sequence they represent. Independently for 25 each non-match interval, the alignment slice is partitioned horizontally into clusters. The 26 same process is then applied to each induced sub-alignment until a maximum number of 27 recursion levels, , has been reached. For any remaining alignments, a node is added to r = 5 28 the local graph for each unique sequence. See Supplementary Animation 1 to see an 29 example of this algorithm. We name this algorithm Recursive Cluster and Collapse (RCC), 30 and implement in the make_prg repository (see Code Availability). 31 (w,k)-minimizers of graphs 32 We define (w,k)-minimizers of strings as in Li (2016) (s  , ) , r 0, }} T j = { p,p+k r : j ≤ p < j + w ∈ { 1 2 be the set of forward and reverse-complement k-mers of in this window. We define a s 3 (w,k)-minimizer to be any triple such that h, , ) The set of (w,k)-minimizers for , is the union of minimizers over such windows.
We extend this definition intuitively to an acyclic sequence graph G = (V,E). Define to be v | | 8 the length of the sequence associated with node and let v This matches the intuitive definition for a path in a sequence graph except that we allow the 12 path to overlap only part of the sequence associated with the first and last nodes. We will 13 use to refer to the sequence along the path in the graph. s p p 14 Let be a path of length w+k-1 in G. The string contains w consecutive k-mers for which p s p 15 we can find the (w,k)-minimizer(s) as before. We therefore define the (w,k)-minimizer(s) of 16 the graph G to be the union of minimizers over all paths of length w+k-1 in G: 18 Local graph indexing with (w,k)-minimizers 19 To find minimizers for a graph we use a streaming algorithm as described in Supplementary 20 Algorithm 1. For each minimizer found, it simply finds the next minimizer(s) until the end of 21 the graph has been reached. 22 Let be a function which returns all vectors of w consecutive k-mers in G alk(v, i, w, ) w k 23 starting at position i on node v. Suppose we have a vector of k-mers x. Let be the hif t(x) s 24 function which returns all possible vectors of k-mers which extend x by one k-mer. It does 25 this by considering possible ways to walk one letter in G from the end of the final k-mer of x. 26 For a vector of k-mers of length w, the function returns the minimizing k-mers of inimize(x) m 27 x. 28 We define K to be a k-mer graph with nodes corresponding to minimizers . We add h, , ) ( p r 29 edge (u,v) to K if there exists a path in G for which u and v are both minimizers and v is the 30 The resulting PanRG index stores a map from each minimizing k-mer hash value to the 34 positions in all local graphs where that (w,k)-minimizer occurred. In addition, we store the 35 induced k-mer graph for each local graph. 1 Quasi-mapping reads 2 We infer the presence of PanRG loci in reads by quasi-mapping. For each read, a sketch of 3 (w,k)-minimizers is made, and these are queried in the index. For every (w,k)-minimizer 4 shared between the read and a local graph in the PanRG index, we define a hit to be the 5 coordinates of the minimizer in the read and local graph and whether it was found in the 6 same or reverse orientation. We define clusters of hits from the same read, local graph and 7 orientation if consecutive read coordinates are within a certain distance. If this cluster is of 8 sufficient size, the locus is deemed to be present and we keep the hits for further analysis. 9 Otherwise, they are discarded as noise. The default for this "sufficient size" is at least 10 hits 10 and at least 1/5th the length of the shortest path through the k-mer graph (Nanopore) or the 11 number of k-mers in a read sketch (Illumina). Note that there is no requirement for all these 12 hits to lie on a single path through the local graph. A further filtering step is therefore applied 13 after the sequence at a locus is inferred to remove false positive loci, as indicated by low 14 mean or median coverage along the inferred sequence by comparison with the global 15 average coverage. This quasi-mapping procedure is described in pseudocode in 16 Supplementary Algorithm 2.
17 Initial sequence approximation as a mosaic of references 18 For each locus identified as present in the set of reads, quasi-mapping provides (filtered) 19 coverage information for nodes of the directed acyclic k-mer graph. We use these to 20 approximate the sequence as a mosaic of references as follows. We model k-mer coverage 21 with a negative binomial distribution and use the simplifying assumption that k-mers are read 22 independently. Let be the set of possible paths through the k-mer graph, which could Θ 23 correspond to the true genomic sequence from which reads were generated. Let r + s be the 24 number of times the underlying DNA was read by the machine, generating a k-mer coverage 25 of s, and r instances where the k-mer was sequenced with errors. Let 1 − p be the probability 26 that a given k-mer was sequenced correctly. For any path let be , θ 1 For choices of w k there is a unique sequence along the discovered path through the ≤ 2 k-mer graph (except in rare cases within the first or last w-1 bases). We use this closest 3 mosaic of reference sequences as an initial approximation of the sample sequence. 4 De novo variant discovery 5 The first step in our implementation of local de novo variant discovery in genome graphs is 6 finding the candidate regions of the graph that show evidence of dissimilarity from the 7 sample's reads. 8 Finding candidate regions 9 The input required for finding candidate regions is a local graph, n , within the PanRG, the 10 maximum likelihood path of both sequence and k-mers in n , and respectively, and mp l n mp k n 11 a padding size w for the number of positions surrounding the candidate region to retrieve. 12 We define a candidate region, r , as an interval within n where coverage on is less than mp l n 13 a given threshold, c , for more than l and less than m consecutive positions. m acts to restrict 14 the size of variants we are able to detect. If set too large, the following steps become much 15 slower due to the combinatorial expansion of possible paths. 16 For a given read, s , that has a mapping to r we define to be the subsequence of s that s r 17 maps to r , including an extra w positions either side of the mapping. We define the pileup P r 18 as the set of all . s r ∈ r 19 Enumerating paths through candidate regions 20 For , where R is the set of all candidate regions, we construct a de Bruijn graph r ∈ R G r 21 from the pileup using the GATB library (54) . and are defined as sets of k-mers to P r A L A R 22 the left and right of r in the local graph. They are anchors to allow re-insertion of new 23 sequences found by de novo discovery into the local graph. If we cannot find an anchor on 24 both sides, then we abandon de novo discovery for r . We use sets of k-mers for and , A L A R 25 rather than a single anchor k-mer, to provide redundancy in the case where sequencing 26 errors cause the absence of some k-mers in . Once is built, we define the start G r G r 27 anchor k-mer, , as the first . Likewise, we define the end a L 28 anchor k-mer, , as the first . a R 29 is the spanning tree obtained by performing depth-first search (DFS) on , beginning T r G r 30 from node . We define as a path, from the root node of and ending at node , a L p r a L T r a R 31 which fulfils the two conditions: 1) is shorter than the maximum allowed path length. 2) p r 32 No more than k nodes along have coverage , where is the expected k-mer p r < f × e r e r 1 coverage for r and f is , where is the number of iterations of path enumeration for r n r × s n r 2 and s is a step size (0.1 by default). 3 is the set of all . If is greater than a predefined threshold, then we have too many V r p r V | | r 4 candidate paths, and we decide to filter more aggressively: f is incremented by s -effectively 5 requiring more coverage for each -and is repopulated. If then de novo p r V r 1.0 f > 6 discovery is abandoned for r . 7 Pruning the path-space in a candidate region 8 As we operate on both accurate and error-prone sequencing reads, the number of valid 9 paths in can be very large. Primarily, this is due to cycles that can occur in and G r G r 10 exploring paths that will never reach our required end anchor . In order to reduce the a R 11 path-space within we prune paths based on multiple criteria. Critically, this pruning G r 12 happens at each step of the graph walk (path-building). 13 We used a distance-based optimisation based on Rizzi et al (55) . In addition to , obtained T r 14 by performing DFS on , we produce a distance map that results from running G r D r 15 reversed breadth-first search (BFS) on , beginning from node . We say reversed BFS G r a R 16 as we explore the predecessors of each node, rather than the successors. is D r 17 implemented as a binary search tree where each node in the tree represents a k-mer in G r 18 that is reachable from via reversed BFS. Each node additionally has an integer attached a R 19 to it that describes the distance from that node to . a R 20 We can use to prune the path-space by 1) for each node , we require and D r n ∈ p r n ∈ D r 21 2) requiring be reached from n in, at most, i nodes, where i is defined as the maximum a R 22 allowed path length minus the number of nodes walked to reach n . 23 If one of these conditions is not met, we abandon . The advantage of this pruning process p r 24 is that we never explore paths that will not reach our required endpoint within the maximum 25 allowed path length and when caught in a cycle, we abandon the path once we have made 26 too many iterations around the cycle. 27 Graph-based genotyping and optimal reference construction for 28 multi-genome comparison 29 We use graph-based genotyping to output a comparison of samples in a VCF. A path 30 through the graph is selected to be the reference sequence, and graph variation is described 31 with respect to this reference. The chromosome field then details the local graph and the 32 position field gives the position within the chosen reference sequence for possible variant 33 alleles. The reference path for each local graph is chosen to be maximally close to the set of 34 sample mosaic paths. This is achieved by reusing the mosaic path finding algorithm detailed 35 in Supplementary Algorithm 3 on a copy of the k-mer graph with coverages incremented 36 along each sample mosaic path, and a modified probability function defined such that the 37 probability of a node is proportional to the number of samples covering it. This results in an 38 optimal path, which is used as the VCF reference for the multi-sample VCF file. 1 For each sample and site in the VCF file, the mean forward and reverse coverage on k-mers 2 tiling alleles is calculated. A likelihood is then independently calculated for each allele based 3 on a Poisson model. An allele in a site is called if: 1) is on the sample mosaic path (i.e. A A 4 it is on the maximum likelihood path for that sample); 2) is the most likely allele to be A 5 called based on the previous Poisson model. Every allele not in the sample mosaic path will 6 not satisfy 1) and will thus not be called. In the uncommon event where an allele satisfies 1), 7 but not 2), we have an incompatibility between the global and the local choices, and then the 8 site is genotyped as null. 9 Comparison of variant-callers on a diverse set of E. coli 10 Sample selection 11 We used a set of 20 diverse E. coli samples for which matched Nanopore and Illumina data 12 and a high-quality assembly were available. These are distributed across 4 major 13 phylogroups of E. coli as shown in Figure 4. Of these, 16 were isolated from clinical 14 infections and rectal screening swabs in ICU patients in an Australian hospital (56) . One is 15 the reference strain CFT073 that was resequenced and assembled by the REHAB 16 consortium (57) . One is from an ST216 cardiac ward outbreak (identifier: H131800734); the 17 Illumina data was previously obtained (58) and we did the Nanopore sequencing (see below). 18 The two final samples were obtained from Public Health England: one is a Shiga-toxin 19 26 for their publication. The PanRG was built using make_prg . Two loci (GC00000027_2 and 27 GC00004221) out of 37,428 were excluded because the combination of Clustal Omega and 28 make_prg did not complete in reasonable time (~24 hours) once de novo variants were 29 added. 30 Nanopore sequencing of sample H131800734 31 DNA was extracted using a Blood & Cell Culture DNA Midi Kit (Qiagen, Germany) and 32 prepared for Nanopore sequencing using kits EXP-NBD103 and SQK-LSK108. Sequencing 33 was performed on a MinION Mk1 Shield device using a FLO-MIN106 R9.4 Spoton flowcell 34 and MinKNOW version 1.7.3, for 48 hours.
1 composed of the allele and 50 bases of flank on either side taken from G1) and then map 2 both to G2 using minimap2 . We then evaluate these mappings to verify if the variant found is 3 indeed correct (TP) or not (FP) as follows. If the mapping quality is zero, the variant is 4 discarded to avoid paralogs/duplicates/repeats that are inherently hard to assess. We then 5 check for mismatches in the allele after mapping and confirm that the called allele is the 6 better match. 7 Constructing a set of ground truth pan-genome variants 8 When seeking to construct a truth set of all variants within a set of bacterial genomes, there 9 is no universal coordinate system. We start by taking all pairs of genomes and finding the 10 variants between them, and then need to deduplicate them -e.g. when a variant between 11 genomes 1 and 2 is the same as a variant between genomes 3 and 4, they should be 12 identified; we define "the same" in terms of genome, coordinate and allele. An allele in a A 13 position of a chromosome in a genome is defined as a triple . and are in the same pan-genome variant ; wV P 2 gV P 27 We model the above problem as a graph problem. We represent each pairwise variant as a 28 node in an undirected graph . There is an edge between two nodes and if and G n 1 n 2 n 1 29 share an allele. Each component (maximal connected subgraph) of then defines a n 2 G 30 pan-genome variant, built from the set of pairwise variants in the component, satisfying all 31 the properties previously described. Therefore, the set of components of defines the set G 32 of pan-genome variants . However, a pan-genome variant in could: i) have more than P P 33 one allele stemming from a single genome, due to a duplication/repeat; ii) represent biallelic 34 , triallelic or tetrallelic SNPs/indels. For this evaluation, we chose to have a smaller, but more 35 reliable set of pan-genome variants, and thus we filtered by restricting it to the set of P 36 pan-genome variants defined by the variants such that: i) has at most P ′ gV P ∈ P gV P 37 one allele stemming from each genome; ii) is a biallelic SNP. is the set of 618,305 gV P P ′ 38 ground truth filtered pan-genome variants that we extracted by comparing and deduplicating 39 the pairwise variants present in our 20 samples, and that we use to evaluate the recall of all 40 the tools in this paper. Supplementary Figure 11 shows an example summarising the 41 described process of building pan-genome variants from a set of pairwise variants. 1 Subsampling read data and running all tools 2 All read data was randomly subsampled to 100x coverage using rasusa -the pipeline is 3 available at https://github.com/iqbal-lab-org/subsampler . A snakemake (71) pipeline to run 4 the pandora workflow with and without de novo discovery (see Figure 2d) is available at 5 https://github.com/iqbal-lab-org/pandora_workflow . A snakemake pipeline to run snippy , 6 SAMtools , nanopolish and medaka on all pairwise combinations of 20 samples and 24 7 references is available at https://github.com/iqbal-lab-org/variant_callers_pipeline . 8 Evaluating VCF files 9 Calculating precision 10 Given a variant/VCF call made by any of the evaluated tools, where the input were reads 11 from a sample (or several samples, in the case of pandora ) and a reference sequence (or a 12 PanRG, in the case of pandora ), we perform the following steps to assess how correct a call 13 is:  4. We further remove calls mapping to masked regions of the sample sequence, in 28 order to not evaluate calls lying on potentially misassembled regions; 29 5. Now we evaluate the mapping, giving the call a continuous precision score between 30 0 and 1. If the mapping does not cover the whole called allele, we give a score of 0. 31 Otherwise, we look only at the alignment of the called allele (i.e. we ignore the 32 flanking sequences alignment), and give a score of: number of matches / alignment 33 length. 34 Finally, we compute the precision for the tool by summing the score of all evaluated calls and 35 dividing by the number of evaluated calls. Note that here we evaluate all types of variants, 36 including SNPs and indels. 37 Calculating recall 38 We perform the following steps to calculate the recall of a tool: 1 1. Apply the VCF calls to the associated reference using the VCF consensus builder 2 ( https://github.com/leoisl/vcf_consensus_builder ), creating a mutated reference with 3 the variants identified by the tool; 4 2. Build probes for each allele of each pan-genome variant previously computed (see 5 Section "Constructing a set of ground truth pan-genome variants"); 6 3. Map all pan-genome variants' probes to the mutated reference using BWA-MEM ; 7 4. Evaluate each probe mapping, which is classified as a TP only if all bases of the 8 allele were correctly mapped to the mutated reference. In the uncommon case where 9 a probe multimaps, it is enough that one of the mappings are classified as TP; 10 5. Finally, as we now know for each pan-genome variant which of its alleles were found, 11 we calculate both the pan-variant recall and the average allelic recall as per Section 12 " Pandora detects rare variation inaccessible to single-reference methods ". 13

Filters
14 Given a VCF file with likelihoods for each genotype, the genotype confidence is defined as 15 the log likelihood of the maximum likelihood genotype, minus the log likelihood of the next 16 best genotype. Thus a confidence of zero means all alleles are equally likely, and high 17 quality calls have higher confidences. In the recall/error rate plots of Figure 5 and Figures  18 6a,b, each point corresponds to the error rate and recall computed as previously described, 19 on a genotype confidence (gt-conf) filtered VCF file with a specific threshold for minimum 20 confidence. 21 We also show the same plot with further filters applied in Supplementary Figure 1. The filters 22 were as follows. For Illumina data: for pandora , a minimum coverage filter of 5x, a strand 23 bias filter of 0.05 (minimum 5% of reads on each strand), and a gaps filter of 0.8 were 24 applied. The gaps filter means at least 20% the minimizer k-mers on the called allele must 25 have coverage above 10% of the expected depth. As snippy has its own internal filtering, no 26 filters were applied. For SAMtools , a minimum coverage filter of 5x was used. For Nanopore 27 data: for pandora , a minimum coverage filter of 10x, a strand bias filter of 0.05, and a gaps 28 filter of 0.6 were used. For nanopolish , we applied a coverage filter of 10x. We were unable 29 to apply a minimum coverage filter to a medaka due to a software bug that prevents 30 annotating the VCF file with coverage information. 31 Locus presence and distance evaluation 32 For all loci detected as present in at least one sample by pandora , we mapped the 33 multi-sample inferred reference to all 20 sample assemblies and 24 references, to identify 34 their true locations. To be confident of these locations, we employed a strict mapping using 35 bowtie2 (73) and requiring end-to-end alignments. From the mapping of all loci to all 36 samples, we computed a truth locus presence-absence matrix, and compared it with 37 pandora 's locus presence-absence matrix, classifying each pandora locus call as true/false 38 positive/negative. Supplementary Figure 3 shows these classifications split by locus length. 39 Having the location of all loci in all the 20 sample assemblies and the 24 references, we then 40 computed the edit distance between them.
• Input packages containing all data to reproduce both the 4-and 20-way analyses