Pangenome Graph Construction from Genome Alignment with Minigraph-Cactus

Reference genomes provide mapping targets and coordinate systems but introduce biases when samples under study diverge sufficiently from them. Pangenome references seek to address this by storing a representative set of diverse haplotypes and their alignment, usually as a graph. Alternate alleles determined by variant callers can be used to construct pangenome graphs, but thanks to advances in long-read sequencing, high-quality phased assemblies are becoming widely available. Constructing a pangenome graph directly from assemblies, as opposed to variant calls, leverages the graph’s ability to consistently represent variation at different scales and reduces biases introduced by reference-based variant calls. Pangenome construction in this way is equivalent to multiple genome alignment. Here we present the Minigraph-Cactus pangenome pipeline, a method to create pangenomes directly from whole-genome alignments, and demonstrate its ability to scale to 90 human haplotypes from the Human Pangenome Reference Consortium (HPRC). This tool was designed to build graphs containing all forms of genetic variation while still being practical for use with current mapping and genotyping tools. We show that this graph is useful both for studying variation within the input haplotypes, but also as a basis for achieving state of the art performance in short and long read mapping, small variant calling and structural variant genotyping. We further measure the effect of the quality and completeness of reference genomes used for analysis within the pangenomes, and show that using the CHM13 reference from the Telomere-to-Telomere Consortium improves the accuracy of our methods, even after projecting back to GRCh38. We also demonstrate that our method can apply to nonhuman data by showing improved mapping and variant detection sensitivity with a Drosophila melanogaster pangenome.


Introduction
The term pangenome has historically referred to the set of genes present across a population or species. The patterns of presence and absence of genes from the pangenome in individual samples, typically prokaryotes, provided a rich context for better understanding the genes and populations in question 1 . Eukaryotic genomes can likewise be combined into pangenomes, which can be expressed in terms of genomic content rather than genes. Eukaryotic pangenomics is growing in popularity, due in part to its potential to reduce reference bias 2 .
A pangenome can be represented as a set of variants against a reference 3 , but technological advances in long-read sequencing are now making it possible to produce high-quality de novo genome assemblies of samples under study, allowing for variation to be studied within its full genomic context 4 . Two themes that have emerged from this work are that 1) relying on a single reference genome can be a source of bias, especially for short-read sequencing projects, and 2) representation of structural variation is a challenging problem in its own right. Pangenomes and the software toolkits that work with them aim to address these issues.
Sequence-resolved pangenomes are typically represented using graph models. There are two main classes of graph representation: sequence graphs and de-Bruijn graphs, and several different methods have been published for each type. This is an area of active research; different methods perform better for different applications, and there is as yet no clear best practice. However, sequence graphs have generally proved more amenable for read mapping 3,5,6 , and they will be the focus of this work. In a sequence graph, each node corresponds to a DNA sequence ( Figure 1A) or its reverse complement depending on the direction in which it is traversed. Sample haplotypes are stored as paths, and edges are bidirected to encode strandedness (i.e. if an edge is incident to the forward or reverse complement sequence of a node). Sites of variation appear as bubbles, or snarls, which are defined by characteristic subgraphs 7 . Two snarls are indicated in the example graph in Figure 1A, the left and right representing a two-base substitution and 19-base deletion, respectively.
Phased Variant Call Format (VCF) files can be thought of as sequence graphs. The vg toolkit makes this perspective explicit by supporting graph construction from VCF 3 . Using such graphs for mapping and variant calling reduces reference bias and improves accuracy over GRCh38 3,6 . These graphs can also be used to accurately genotype structural variants (SVs) 5 , but they are still limited to reference-based variant calls. For example, there is no satisfactory way in VCF 4.3 to directly represent variation nested within a large insertion. Now that they are becoming widely available 8 , high-quality assemblies can instead be used to directly construct a pangenome graph without the need to go through variant calls. This is equivalent to finding a whole genome multiple alignment, which is known to be an extremely computationally challenging problem 9 . view of a sequence graph shows two haplotypes as paths through the graph. The two snarls (variation sites defined by graph topology, aka bubbles) are highlighted. B) The five steps, and associated tools, of the Minigraph-Cactus pipeline which takes as input genome assemblies in FASTA format and outputs a pangenome graph, genome alignment, VCF and indexes required for mapping with vg Giraffe. Illustrating the steps in the pipeline by example: C) SV graph construction using minigraph (as wrapped by cactus-minigraph) begins with a linear reference and adds SVs, in this case a single 1204bp inversion (at ch2L:17,144,069 in the D. melanogaster pangenome). D) The input haplotypes are mapped back to the graph with minigraph, in this example six of which contain the inversion allele from C. E) The minigraph mappings are combined into a base-resolution graph using Cactus, augmenting the larger SVs with smaller variants -in this case, adding smaller variants within the inversion. F) An unaligned centromere is clipped out of a graph, leaving only the reference (blue) allele in that region. The other alleles are each broken into two separate subpaths but are otherwise unaffected outside the clipped region.

Minigraph SV graph construction
The pipeline begins with the construction of an initial SV-only graph using minigraph as described in 16 . By default, only variants affecting 50bp of sequence or more are included. This is an iterative procedure that closely resembles partial order alignment (POA): a "reference" assembly is chosen as an initial backbone, and then augmented with variation from the remaining assemblies in turn. Figure 1C shows an example of an inversion being augmented into a reference chromosome. Minigraph does not collapse duplications: If two copies of a gene are present in the graph after adding i genomes, but there are three copies in the i+1th genome, then an additional copy will be added to the graph. This is a key difference between minigraph and other approaches (including Progressive Cactus) that would tend to collapse all copies of the gene into a single sequence in the absence of outgroup information to determine the ancestral state. By keeping different gene copies separate, minigraph trades greater graph size for reduced path complexity (fewer cycles).

Minigraph contig mapping
Minigraph generalizes the minimizer-based seeding and chaining concepts from minimap2 18 for use on sequence graphs. For this current work we generalized it to produce base-level alignments between contigs and graphs (but not base-level graphs). In this step of the pipeline each assembly, including the reference, is mapped back to the SV graph independently ( Figure  1D). The results are concatenated into a single Graphical Alignment Format (GAF) file, which is then filtered to remove spurious alignments (See Methods Section for details). By re-aligning each assembly to the same graph in this step as opposed to re-using the iterative mappings created during construction, we mitigate an issue in the latter where orthologous sequences can be aligned to inconsistent locations when mapped to different versions of the graph.

Splitting by chromosome
Minigraph does not introduce interchromosomal events during graph construction, so every node in the SV graph is connected to exactly one chromosome (or contig) from the reference assembly. This information is used to split the mappings obtained in the previous step into chromosomes. If a contig maps to nodes from multiple chromosomes, it is assigned to the chromosome to which the most of its bases align. Thresholds (detailed in the Methods Section) are used to filter out contigs that cannot be confidently assigned to any reference chromosome. Such contigs will be excluded from the constructed graph. Graph construction proceeds on each reference chromosome independently, which serves to increase parallelism and reduce peak memory usage (per job). These computational advantages are required to construct a 90-sample human pangenome graph on current hardware, but smaller datasets could be run all at once if desired, avoiding this step entirely.

Cactus base alignment
At its core, Cactus is a procedure for combining a set of pairwise alignments into a multiple alignment 13,20 : It begins by "pinching" exactly matching aligned bases together in the pairwise alignments to form an initial sequence graph ( Figure 1A). This sequence graph is then transformed into a Cactus graph (Supplementary Figure 1A-C), whose cycles represent the "chains" of alignment within the sequence graph 15 . The topology of the Cactus graph is first used to remove candidate spurious or incomplete alignments corresponding to short, high-degree alignment chains. Interstitial unaligned sequences that share common anchors at their ends are then aligned together. This process as a whole remains unchanged at a conceptual level when using Cactus to construct pangenome alignments, but substantial changes to each step were required by the increase in the number of input genomes: Cactus does not typically align more than four genomes (two ingroups and two outgroups) at a time when computing progressive alignments, so scaling to 90 HPRC samples (and beyond) required the underlying graph structures to be rewritten to use less memory, as well as completely replacing the algorithm for interstitial sequence alignment. Briefly, the previous all-pairs approach, which scales quadratically with the number of genomes, was replaced with a Partial Order Alignment (POA) approach that scales linearly (See Methods for details).
Cactus natively outputs genome alignments in Hierarchical Alignment (HAL) format 21 . HAL files can be used to create assembly hubs on the UCSC genome browser, or to map annotations between genomes 22 , but they are not suitable for most pangenome graph applications, which expect GFA or VG. We therefore created a new tool, hal2vg, to convert HAL alignments into VG format. (see Methods for more details). These graphs contain the underlying structural variation from the SV graph constructed by minigraph along with smaller variants, and the input haplotypes are represented as paths ( Figure 1E).

Indexing and clipping
The final step of the pipeline combines the chromosome level results and performs some post-processing. This includes reassigning node ids so that they are globally unique across different chromosome graphs, and collapsing redundant sequence where possible using gaffix 23 (Supplementary Figure 1D). Nodes are also replaced with their reverse complement as necessary to ensure that reference paths only ever visit them in the forward orientation. The original SV graph produced by minigraph remains embedded in the results at this stage, with each minigraph node being represented by a separate embedded path.
Minigraph-Cactus (in common with all MSA tools we know of 24 ) cannot presently satisfactorily align highly repetitive sequences like satellite arrays, centromeres and telomeres because they lack sufficiently unique subsequences for minigraph to use as alignment seeds. As such, these regions will remain largely unaligned throughout the pipeline and will make the graph difficult to index and map to by introducing vast amounts of redundant sequence. We recommend clipping them out for most applications and provide the option to do so by removing paths with >N bases that do not align to the underlying SV graph constructed with minigraph ( Figure 1F). In preliminary studies of mapping short reads and calling small variants (see below), we found that even more aggressively filtering the graph helps improve accuracy. For this reason, an optional allele-frequency filter is included to remove nodes of the graph present in fewer than N haplotypes and can be used when making indexes for vg giraffe.
In all, up to three graphs are produced while indexing: 1) Full graph: useful for storing complete sequences and performing liftover (translation between corresponding haplotypes); difficult to index and map to because of unaligned centromeres. These graphs are typically created only as intermediate results, and are not directly used in any of the results in this report. 2) Default graph: clip out all stretches of sequences >=10kb that do not align to the minigraph. The intuition is that large SVs not in minigraph are under-alignments of sequence not presently alignable and not true variants. The 10kb threshold is arbitrary but empirically was found to work well. This graph is ideal for studying variation and exporting to VCF, and can be effectively indexed for read mapping. These graphs are used in all results unless otherwise is explicitly stated. 3) Allele-frequency filtered graph: remove all nodes present in fewer than N haplotypes.
This filter increases accuracy for short read mapping and variant calling, as shown in Supplementary Figures 6 and 7, respectively. These graphs are used for mapping with vg giraffe.
Graph 2) is a subgraph of graph 1), and graph 3) is a subgraph of graph 2). They are node-id compatible, in that any node shared between two of the graphs will have the same sequence and ID. Unless otherwise stated, all results below about the graphs themselves are referring to the default graphs, whereas all results pertaining to short read mapping and small variant calling were performed on the allele-frequency filtered graphs.

Human Pangenome Reference Graphs
The Minigraph-Cactus pipeline was originally developed to construct a pangenome graph for the assemblies produced by the Human Pangenome Reference Consortium (HPRC). In its first year, this consortium released 47 diploid assemblies 25 . For evaluation purposes, we held out three samples when generating the graph: HG002, HG005 and NA19240. The remaining 44 samples (88 haplotypes), and two reference genomes (GRCh38 and CHM13 [v1.1] 26 were used to construct the graph, with 90 haploid genomes total. Since the construction procedure is dependent on the reference chosen for the graph, we ran our pipeline twice independently on the same input assemblies, once using GRCh38 as the reference and once CHM13. The CHM13-based graph includes more difficult and highly variant regions, such as in the acrocentric short arm of chr21, that are not represented in the GRCh38-based graph. This makes it slightly bigger than the GRCh38-based graph, both in terms of total sequence and in terms of nodes and edges (Supplementary Table 1). The final pangenomes have roughly 200X more nodes and edges than the SV Graphs from Minigraph, showing the amount of small variation required in order to embed the haplotype paths. Figure 2A shows the amount of non-reference sequence as a function of how many haploid genomes contain it (the same plot for total sequence can be found in Supplementary Figure 2). The rise in the leftmost points (support=1) is due to private sequence, only present in one sample, and may also contain alignment artifacts which often manifest as under-alignments affecting a single sample. The plot clearly shows that the CHM13-based graph has less non-reference sequence present across the majority of samples, an apparent consequence of the improved completeness of CHM13 over GRCh38. The distribution of allele sizes within snarls (variant sites in the pangenome defined by graph topology; Figure 2B) highlights the amount of small variation added relative to Minigraph alone. The total time to create and index each HPRC pangenome graph was roughly 3 days (Supplementary Table 4).

Figure 2: Evaluating GRCh38 and T2T-CHM13 based human pangenomes A)
The amount of non-reference sequence in the HPRC graphs by the minimum number of haplotypes it is contained in. B) Distribution of the size of the snarls (variation sites, aka bubbles) for the GRCh38-based Minigraph, GRCh38-based and CHM13-based Minigraph-Cactus pangenomes. Note that in the case of overlapping variants, snarls can be much larger than any single event that they contain. C)~30x Illumina short-reads for three GIAB samples were mapped using three approaches: BWA-MEM on GRCh38 (blue), vg Giraffe on the linear pangenomes with GRCh 38 or CHM13 (grey), vg Giraffe on the GRCh38-referenced or CHM13-referenced HPRC pangenome (red). C) Proportion of the reads aligning perfectly to the (pan-)genome for each sample (y-axis

Mapping to the HPRC Graphs
We benchmarked how well the pangenome graphs could be used as drop-in replacements for linear references in a state-of-the-art small variant (<50bp) discovery and genotyping pipeline.
To do so, we used Illumina short reads (~30x coverage) from three Genome in a Bottle (GIAB) samples, HG001, HG002, and HG005. All mapping experiments were performed on filtered HPRC graphs with a minimum allele frequency of 10%, meaning that nodes supported by fewer than 9 haplotypes were removed. This threshold was chosen to maximize variant calling sensitivity and mapping speed for the Giraffe-DeepVariant pipeline ( Supplementary Figures 8  and 9, respectively). We found that reads aligned with higher identity when mapped to the pangenomes using Giraffe, compared to the traditional approach of mapping reads with BWA-MEM on GRCh38. We also mapped reads to the linear references with Giraffe and achieved similar results to using BWA. On average, 78.1% and 78.9% of reads aligned perfectly for the GRCh38-based and CHM13-based pangenomes, respectively, compared to 68.7% when using BWA-MEM on GRCh38 ( Figure 2C). Similarly, reads mapped to the pangenomes had higher alignment scores (Supplementary Figure 5). Mapping to the pangenomes results in a slight drop in mapping confidence, from about 94.9% to 94.1% of reads with a mapping quality greater than 0 (Supplementary Figure 6) in those samples. This is expected as the pangenome contains more sequence than GRCh38, including complex regions and large duplications that are more fully represented, which naturally and correctly reduces mapping confidence for some reads. The same trend is observed when the pangenome is not filtered by frequency (Supplementary Figure 6). We also compared the alignment of long HiFi reads, mapped with GraphAligner 27 . Mapping to the pangenomes results in more long reads mapped fully (i.e. no split mapping) and with high identity ( Figure 2D).

Variant Calling with the HPRC Graphs
We used the short-read alignments to call variants with DeepVariant 28 . To prepare them for DeepVariant, the graph alignments were projected onto GRCh38 using the vg toolkit. Note that, even though the CHM13-based graph did not use GRCh38 as the initial reference, the graph does contain GRCh38. Thus, the CHM13-based graph can also be used in this pipeline.
Both pangenomes constructed with Minigraph-Cactus outperform current top-performing methods ( Figure 2E-F). We note that reads in regions that are falsely duplicated or collapsed in GRCh38 cannot be unambiguously projected from their corrected alleles in CHM13. For this reason, these regions were removed from the benchmark when evaluating the CHM13-based pangenome. Unsurprisingly, the CHM13-based pangenome offers the largest gains in variant calling in challenging regions like those assessed by the Challenging Medically Relevant Genes (CMRG) truth set ( Figure 2E) 29 . Figure 1F shows the precision and recall curves and the CHM13-based pangenome-based variant calls vs state of the art methods based on linear references for the CMRG benchmark. The CHM13-and GRCh38-based pangenomes have F1 scores 0.9830 and 0.9823, respectively, compared to 0.9777 and 0.9756 of Dragen and BWA-MEM DeepVariant, respectively. This gain in F1, though modest, still corresponds to hundreds of variants in these regions ( Figure 2E). The frequency-filtered pangenomes performed better than using the default pangenomes (Supplementary Figure 7). We also tested projecting and calling variants on CHM13. Although the benchmarking protocol is still preliminary for CHM13, we observed a clear improvement when using the pangenome compared to aligning the reads to CHM13 only (Supplementary Figure 10). Some specific regions, including the MHC region and segmental duplications, also have better variant calls on the CHM13-based graph (Supplementary Figure 11).

Structural Variant Genotyping with the HPRC Graphs
PanGenie is a state-of-the art tool for genotyping human structural variation using short reads 30 .
It uses an HMM that combines information from known haplotypes in a pangenome (as represented by phased VCF) along with kmers from short reads in order to infer genotypes and, as such, does not require any read mapping. Minigraph-Cactus can output phased VCF representations of pangenome graphs that can be used as input to PanGenie (see Methods for more details). We evaluated this process by genotyping a cohort of 368 samples from the 1000 Genomes Project 31 (1KG) comprising 20 trios randomly selected from each of the five superpopulations, along with the samples present in the graphs. We repeated this process independently on three different graphs: the GRCh38-based and CHM13-based HPRC pangenomes, as well the v2.0 PanGenie lenient variant set produced by the Human Structural Variation Consortium (HGSVC) 32 . This latter graph was made by constructing reference-based variant calls for each sample, then merging similar variants together into single consensus variants, exactly the process that our pipeline is designed to avoid. The number of variants in each graph is given in Supplementary Table 3.
In order to measure PanGenie's accuracy on each graph, we performed a leave-one-out experiment on five samples from the graphs. For each selected sample, its genotypes and private variants were removed from the VCF, which was then re-genotyped with PanGenie using short reads from that sample. These genotypes were then compared back to those from the original graph, effectively measuring how closely the haplotypes from short-read genotyping correspond to the original, assembly-based haplotypes. Due to their disjoint sample sets, different samples were used for the HPRC (HG00438, HG00733, HG02717, NA20129, HG03453) and HGSVC (HG00731, HG00512, NA19238, NA19650, HG02492). The results are shown in Figure 3A, which shows the weighted genotype concordance 30 across different types of variants, with the Minigraph-Cactus HPRC graphs showing significantly higher accuracy across all SV variant types than the HGSVC. This improvement can be attributed to the higher quality and number (44 vs 32) of the HPRC vs HGSVC assemblies, as well as the more exact representation of variation, SVs in particular, in the multiple alignment-based Minigraph-Cactus graphs, which would explain the increased delta for SV insertions in particular. This more exact representation also explains why the HPRC graph-based genotypes have fewer very common structural variants (AF>20%) (Figure 3B), despite containing significantly more variants ( Figure  3C,D). As with the short read variant calling results, the CHM13-based HPRC graph performs generally better than the GRCh38-based graph (Supplementary

D. Melanogaster Pangenome
We created a Drosophila melanogaster pangenome to demonstrate Minigraph-Cactus's applicability to non-human organisms. We used 16 assemblies including the reference, dm6 (ISO1), 14 geographically diverse strains described in 33 , and one additional strain, B7. Their sizes range from 132 to 144 Mb. The allele frequency filtered graph, used for all mapping experiments, was created by removing nodes appearing in < 2 haplotypes leading to a minimum allele frequency of~12.5% (compared to 10% in the human graph), and was used only for mapping and genotyping, where private variation in the graph is less helpful. The amount of sequence removed by clipping and filtering is shown in Supplementary Figure 16. The relatively small input meant that we could align it with Progressive Cactus using an all-vs-all (star phylogeny) rather than progressive alignment, and the results are included for comparison. In all, we produced five D. melanogaster graphs whose statistics are shown in Supplementary Table 2, a process that took roughly 5 hours for the pangenomes (Supplementary Table 4) and 19 hours for the progressive Cactus alignments (Supplementary Table 5). As in human, adding base-level variants to the SV graph increases its number of nodes and edges by roughly two orders of magnitude. The graph created from the Progressive Cactus alignment has roughly 45% more nodes and edges and over double the total node length (Supplementary Table 2). This is partially explained by the fact that it contains all the sequence filtered out during pangenome construction (Supplementary Figure 16) along with interchromosomal alignments.
The "core" genome size, which we define as the total length of all nodes present in all samples, of the Minigraph-Cactus pangenome is 110 Mb (Supplementary Figure 12, first column), which is roughly half the total size of the graph. This reflects a high diversity among the samples: private transposable element (TE) insertions are known to be abundant in this species 33 . This diversity is also shown in Figure 4A, which graphs the amount of non-reference sequence by the minimum number of samples it is present in, where the private TE insertions would account for much of the nearly 10X differene between the first and second columns. The trend for the number of non-reference nodes is less pronounced (Supplementary Figure 14), which implies that the non-reference sequence is accounted for by larger insertion events and smaller variants tend to be more shared. We used the snarl subgraph decomposition 7 to compute the variant sites within each graph, i.e. subgraphs equivalent to individual SNPs, indels, SVs, etc. Supplementary Figure 15 shows the pattern of nesting of the variant sites in the various graphs.

Short-read Mapping
The Drosophila melanogaster Genetic Reference Panel (DGRP) consists of 205 inbred genomes 34 , unrelated to the 16 strains used to construct the pangenome. We used short reads from this dataset to evaluate mapping performance for our pangenome graph. We selected 100 samples for our evaluation, filtering the dataset to include only samples with a single SRA accession and Illumina sequencing with >15X coverage. We mapped these samples to the allele frequency filtered pangenome graph with vg giraffe in "fast" mode, and to dm6 using BWA-MEM. We counted the number of mapped reads, reads with perfect alignment, and reads with a mapping quality above 0. We found that the number of reads aligning perfectly drastically increased ( Figure 4B), with on average 41.1% of the reads aligning perfectly to the pangenome compared to on average 31.0% when aligning reads with BWA on dm6. As in our results in human presented above, we observe a decrease in the number of reads mapped with a mapping quality above 0 when mapping to the pangenome (80.0% vs 81.1% on average, Figure 4C).

Small Variants
We projected pangenomic mappings to dm6, and used FreeBayes 35 (in the absence of a high quality DeepVariant model) to call variants on these mappings and those from BWA-MEM (see Methods). We then compared the variant calls that were called by both approaches, and those that were called by only one. While variant sites called by both methods showed similar quality scores, there were more sites unique to our pangenomic approach compared to sites found only by mapping reads to the linear dm6 genome. This increase was observed across different quality thresholds (Supplementary Figure 17A,C). Overall, that meant that slightly more variants are called when mapping short reads to the pangenome and projecting them to dm6. For example, on average 740,696 small variants had a quality above 0.1 compared to 738,570 when reads were mapped to the dm6 with BWA-MEM (Supplementary Figure 17B). For genotype quality above 10, 705,320 small variants were called versus 700,385 (Supplementary Figure 17D). We also noticed a lower rate of heterozygous variants called when mapping the reads to the pangenome first (13.2% vs 18.1% on average per sample, Supplementary Figure  18). Due to the high inbreeding of these samples, we expect only a small fraction of variants to truly be segregating 34 .

Structural Variants
The variant sites in the pangenome (snarls) were decomposed into canonical structural variants based on the assembly paths in the pangenome (see Methods). In the pangenome, most of the SVs are rare and supported by one or two assemblies ( Figure 4D). Of note, the known In(3R)C inversion 36 is present in the pangenome, along with 23 other smaller inversions. Structural variants were also genotyped from the short read alignments to the pangenome using vg 5 (see Methods). Even though the genotyping used short reads and the pangenome was frequency-filtered, 47.8% of the SVs in the pangenome were found when genotyping the 100 samples (on the filtered pangenome) with short-read data. Both the full set of SVs in the pangenome and the subset genotyped from the short read data span the full size spectrum of deletions, insertions and a few inversions ( Figure 4E). As expected, SVs that were seen in multiple assemblies in the pangenome tended to have higher allele frequencies in the cohort of 100 samples (Figure 4F). Both rare and more common SVs spanned the full spectrum of SV size and repeat profile, from the shorter simple repeats and satellite variation to the larger transposable element polymorphisms of LTR/Gypsy, LTR/Pao, and LINE/I-Jockey elements, among others (Supplementary Figure 19).

Discussion
The coordinate system provided by the human reference genome assembly has been vital to nearly all research in human genetics, but it can also be a source of bias. This bias can take the form of unmappable reads in the presence of diverse samples 37 or, more subtly, variant calls being skewed towards the reference allele 3,6 . Pangenome graphs have been shown to be effective at reducing reference bias, but their construction has, until now, been limited by trade-offs. Either the graphs needed to be constructed from variant calls against a reference 3,6,32 , and therefore unable to properly represent nested variation while still suffering from some reference bias, or they were limited to only structural variants 16 and unable to effectively be used for short read mapping with current tools 6 . The method we present here overcomes these issues by constructing a pangenome graph directly from a multiple genome alignment that represents nearly all the variation within its inputs.
The challenges of effectively leveraging pangenome graphs for human data do not end at construction. Tooling for analysis, such as read mapping and genotyping, which by definition is more complex for graphs than single reference genomes, is essential. To this end we have ensured that graphs produced with Minigraph-Cactus are the first to be compatible with the majority of state-of-the art pangenome tools (re-engineering the tools as necessary) such as vg 3,6 , Giraffe 3,5,6 , PanGenie 30 and GraphAligner 27 . These tools are all free and open source. Graphs constructed with Minigraph-Cactus are also freely available for download from the Cactus website and through the HPRC 38 .
To demonstrate the usefulness of these graphs and tools, we showed that Illumina and Hifi reads can be mapped with higher identity and fewer split mappings, respectively, to the pangenome than the linear reference. In the former case, the mappings are used to also improve accuracy of short-read variant calling, and we are hopeful that similar gains will be made with long reads when pangenomics tools for variant calling with them are developed. The representation of structural variants in our multiple-alignment based graphs also show considerable improvements in genotyping accuracy when compared to previous methods that rely on merging reference-based calls.
In the case of DeepVariant and PanGenie, the pangenome graph is used in the context of existing reference-based formats such as BAM and VCF. This allows users to augment their existing workflows with pangenomes with minimal changes, which we think will be key to fostering more widespread adoption of pangenomics methods. Still, such projections back to a linear reference can be lossy, especially in complex regions. While GAF is being increasingly adopted as the standard read mapping format for pangenomes, there is no corresponding graph-based alternative to VCF in use that we are aware of, and the necessity of always projecting variants back to VCF for analysis is a bottleneck to reaching the full potential of pangenome graphs. True graph-based genotyping formats and tools are needed.
Minigraph-Cactus requires at least one chromosome-level input assembly in order to be used as a reference backbone and, in general, the quality and usefulness of the pangenome will increase with the quality and completeness of all the input assemblies. We do not think this will be a bottleneck for most species going forward as it will soon be routine to produce large numbers of reference, or even "telomere-to-telmore" quality genomes for many species due to advances in sequencing technology and assembly tools. In the present work, we have quantified the impact of reference genome assembly quality on our pangenomes and their applications. Even though both GRCh38 and CHM13 are included in all HPRC graphs we constructed, the choice of which to use as a reference backbone influences the topology and completeness of the graph, and in virtually all genome-wide measures of mapping, variant calling and genotyping performance, we found the CHM13-based graph to be superior. In the case of variant calling with Giraffe-DeepVariant, we showed that the CHM13-based graph was able to improve upon the state-of-the art accuracy of the GRCh38-based graph, even when making calls on GRCh38. We therefore think our pangenomes could help some users who would otherwise be reluctant to switch to reference assemblies, still take advantage of them.
Building upon previous work in pangenomics, the HPRC has shown that high-quality genome assemblies can be leveraged to provide a better window into structural variation, as well as to reduce bias incurred by relying on a single reference. The pangenome graph representation has been fundamental to this work, but graph construction remains an active research area. The key challenges stem not just from the computational difficulty of multiple genome alignment, particularly in complex regions, but also from fundamental questions about the tradeoffs between complexity and usability. While developing Minigraph-Cactus, we sought a method to construct graphs with as much variation as possible, while still serving as useful inputs for current pangenome tools like vg and PanGenie.
Some of the compromises made to make our method practical represent exciting challenges for future work in both pangenome construction and applications. Pangenomes from Minigraph-Cactus cannot be used, for instance, to study centromeres. The omission of interchromosomal events will likewise preclude useful cancer pangenomes or studies into acrocentric chromosome evolution 39 . We are also interested in ways to remove the necessity of filtering the graph to get optimal mapping performance by using an online method at mapping time to identify a subgraph that most closely relates to the reads of a given sample. Progressive Cactus alignments can be combined and updated and, as data sets become larger, this functionality is becoming more necessary for pangenome alignments. Comprehensive tooling to update pangenomes by adding, removing or updating assemblies is an area of future work.
Pangenomics has its origin in non-human species, and as the assembly data becomes available, we will see pangenomes being produced for a wide array of organisms. Already there is data for a number of species, from tomato 40 to cow 41 . In this work, we constructed a D. melanogaster pangenome as a proof of concept to show that our method can also be used on other non-human organisms. We hope that others will use the Minigraph-Cactus pipeline to produce useful graphs from sets of genome assemblies for their species of interest. Large-scale alignments are resource intensive, and the 90-human pangenomes required nearly three days to compute on a cluster. As such, we've made these alignments publically available through the HPRC and will do the same for future releases.
Reference bias can also affect comparative genomics studies. For example, a genomic region can be of interest to a particular sample, but if that region happens to be missing from the reference genome due intraspecies diversity or assembly errors, it would be absent from any alignments based solely on that reference. Therefore we expect pangenome references to supplant single genome references for intraspecies population genomics studies, we also see this as the future in interspecies comparative genomics studies

HPRC Graph Construction
The HPRC v1.0 graphs discussed here were created by an older version of the pipeline described above, with the main difference being that the satellite sequence was first removed from the input with dna-brnn 42 . This procedure is described in detail in 25 . The amount of sequence removed from the graph, and the reason it was removed, is shown in Supplementary  Figure 2. Roughly 200 Mb per assembly was excluded, the majority of which was flagged as centromeric (HSat2 or alpha satellite) by dna-brnn 42 . The "unassigned", "minigraph-gap" and "clipped" categories denote the sequence that, respectively, did not map well enough to any one chromosome to be assigned to it, intervals > 100kb that did not map with minigraph, and intervals > 10kb that did not align with Cactus. Simply removing all sequence ≥10kb that does not align with Cactus, as described in the methods above, amounts to nearly the same amount of sequence excluded (Supplementary Figure 3). The 10kb threshold was used for clipping because it was sufficient to remove all centromeres (as previously identified) with dna-brnn and also because it corresponds to the maximum length of an alignment that can be computed with abPOA. The exact commands to build HPRC graphs referred to in this figure are available here: https://github.com/ComparativeGenomicsToolkit/cactus/blob/91bdd83728c8cdef8c34243f0a52b 28d85711bcf/doc/pangenome.md#hprc-graph. They were run using the same Cactus commit: 91bdd83728c8cdef8c34243f0a52b28d85711bcf.

Filtering Minigraph Mappings and Chromosome Decomposition
Input contigs were labeled "unassigned" above if they could not be confidently mapped to a single reference chromosome during the Minigraph contig mapping phase of the pipeline. For a given contig, this determination was made by identifying the chromosome in the SV graph to which the highest fraction of its bases mapped with exact matches. If this highest fraction was at least three times higher than the second highest, and greater than or equal to a minimum threshold, the contig was assigned to that chromosome, otherwise it was left unassigned (and omitted from the graph). The minimum threshold for chromosome assignment was 75% for contigs with length ≤ 100 kb, 50% for contigs with length in the range (100 kb, 1 Mb] and 25% for with length > 1 Mb. These values were chosen after empirical experimentation specifically to filter out spurious mappings as determined by VCF-based comparison with HiFi-based DeepVariant calls 25 . Contigs filtered in this way are predominantly centromeric (and can't be confidently mapped anywhere) or small fragments of acrocentric chromosome short arms or segmental duplications without enough flanking sequence to be correctly placed, or regions enriched for putative misjoins (which also occur predominantly within the acrocentric chromosome short arms) 25 . Such filtering is not needed on chromosome level assemblies.
Despite this filtering process, we found a small number of small contigs that, due to either misassembly or misalignment, confidently map across entire chromosome arms (one of the contig maps near the centromere and the other near the telomere). The chromosome arm-spanning edges introduced by such mappings introduce topological complexities that can hinder downstream tools (for example, all variants on the spanned arm would be considered nested within a large deletion). To prevent this, any mapping that would introduce a deletion edge of 10 Mb or more (tunable by a parameter) relative to the reference path is removed. Finally, in rare cases, minigraph can map the same portion of a query contig to different target regions in the graph. When manually inspecting these cases, we found that they could lead to spurious variants in the graph when, as above, compared to variant calls directly from HiFi-based DeepVariant calls (Liao et al., 2022). To mitigate these cases, we remove any aligned query interval (pairwise alignments are represented in terms of the query intervals, positions on the contig, and target intervals, paths within the graph) that overlaps another by at least 25% of its length, and whose mapping quality and/or block length is 5X lower than those of the other interval.

POA-based Cactus Base Aligner
We replaced the base-level alignment refinement (BAR) algorithm that is used to create alignments between the interstitial sequences after the initial anchoring process 20 . Briefly, the original algorithm has two stages. Firstly, from the end of each alignment anchor (termed a block, and defined by a gapless alignment of substrings of the input) it creates a MSA of the unaligned sequences incident with the anchor. Each such MSA has the property that the sequence alignment is pinned from the anchor point, but because of rearrangement, the MSA is not necessarily global, i.e. at the other end of the MSA from the starting anchor point the different sequences may be non-homologous due to genome rearrangement. Secondly, the set of MSAs produced by the first step are refined by a greedy process which seeks to make the set of MSAs, which may overlap in terms of sequence positions, consistent, so resolving, at base-level resolution, the breakpoints of genome rearrangements. For details of this process see the original paper 20 .
The replacement BAR algorithm achieved two things. Firstly, we changed the process in the first step to create MSAs to use the abPOA MSA algorithm 43 . The previous algorithm was based upon the original Pecan MSA process, and scaled quadratically with sequence number, in contrast the new MSA process scales linearly and is overall faster even for small numbers of sequences. In this process we updated abPOA to use the LASTZ default scoring parameters 12 , with the addition of a "long" gap state not used by LASTZ but included within abPOA. Gap parameters were thus: short-gap-open: 400, short-gap-extend: 30, long-gap-open:1200, long-gap-extend:1. Parameters for the long-gap-state were determined by empirical experimentation. Secondly, we fully reimplemented the second step of the BAR algorithm, making it both faster and removing various unnecessary bottlenecks which previously scaled superlinearly but which now all scale linearly with sequence number and length. Importantly, this process did not materially affect the resulting alignments, as judged by extensive unit-and system-level testing.

Conversion from Multiple Alignment to Sequence Graph
Cactus natively uses Hierarchical Alignment (HAL) format 21 . We developed hal2vg, which converts HAL files to vg formats. It works for both Progressive and Minigraph-Cactus. It works in memory and, for large alignments, is reliant on having chromosomal decomposition of the HAL and simple topology to run efficiently. hal2vg begins by visiting the pairwise alignments in breadth-first order from the root of the underlying guide tree. Contiguous runs of exact matches in the pairwise are "pinched" together to form nodes of a sequence graph using Cactus 15 , and the assemblies themselves are added as "threads" to this graph. SNPs are stored in an auxiliary data structure and used to pinch together transitive exact matches as they arise. For example, if the pairwise alignments of a column (in the multiple alignment) are A->C and C->A, this structure will ensure that the two A's are pinched together in the sequence graph (which, by definition, only represents exact matches within its nodes). Seqwish 44 is a recent tool that also induces sequence graphs from sets of pairwise alignments but, because it does not transitively process SNPs in this way, will not work on tree-based sets of pairwise alignments as represented by HAL. Finally, once the sequence graph has been created in memory, it is serialized to disk, path by path, using libbdsg 45 , an API for reading and writing sequence graphs in an efficient, VG-compatible binary format.
Conversion from Sequence Graph to VCF By default, all graphs are output in GFA (v1.1), as well as the vg-native indexes: xg, snarls and GBWT formats 45,46 . Since VCF remains more widely-supported than these formats, we implemented a VCF exporter in vg (vg deconstruct) that is run as part of the Minigraph-Cacatus pipeline. It outputs a site for each snarl in the graph. It uses the haplotype index (GBWT) to enumerate all haplotypes that traverse the site, which allows it to compute phased genotypes. For each allele, the corresponding path through the graph is stored in the AT (Allele Traversal) tag. Snarls can be nested, and this information is specified in the LV (Level) and PS (Parent Snarl) tags, which needs to be taken into account when interpreting the VCF. Any phasing information in the input assemblies is preserved in the VCF.

HPRC Graph Mapping and Variant Calling
We used 30x Illumina NovaSeq PCR-free short read data HG001, HG002, and HG005, available at gs://deepvariant/benchmarking/fastq/wgs_pcr_free/30x/. The reads were mapped to the pangenome using vg giraffe (v1.37.0). The same reads were mapped to GRCh38 with decoy sequences, but no ALTs using BWA-MEM (v0.7.17). To provide additional baselines, reads were also mapped with vg giraffe to linear pangenomes, i.e. pangenomes containing only the reference genome (GRCh38 or CHM13). The number of reads mapped with different mapping quality (or aligning perfectly) were extracted from the graph alignment file (GAF/GAM files) produced by vg giraffe and from the BAM files produced by BWA-MEM.

Evaluation of small variant calls
Calls on GRCh38 were evaluated as in 25  When evaluating calls made against the GRCh38 chromosomal paths using the CHM13-based pangenome, we excluded regions annotated as false-duplications and collapsed in GRCh38. These regions do not have a well-defined truth label in the context of CHM13. We used the "GRCh38_collapsed_duplication_FP_regions", "GRCh38_false_duplications_correct_copy", "GRCh38_false_duplications_incorrect_copy", and "GRCh38_population_CNV_FP_regions" region sets available at https://github.com/genome-in-a-bottle/genome-stratifications.
To evaluate the calls made on CHM13 v1.1, we used two approaches. First, the calls from CHM13 v1.1 were lifted to GRCh38 and evaluated using the GRCh38 truth sets described above (GIAB v4.2.1 and CMRG v1.0). For this evaluation, we also lifted these GRCh38-based truth sets to CHM13 v1.1 to identify which variants of the truth set are not visible on CHM13 because they are homozygous for the CHM13 reference allele. Indeed, being homozygous for the reference allele, those calls will not be present in the VCF because there are no alternate alleles to find. These variants were excluded from the truth set during evaluation. The second approach was to evaluate the calls in CHM13 v1.1 directly. To be able to use the CMRG v1.0 truth set provided by the GIAB, we lifted the variants and confident regions from CHM13 v1.0 to CHM13 v1.1. The CMRG v1.0 truth set focuses on challenging regions, but still provides variant calls across the whole genome. Hence, we used those variants to evaluate the performance genome-wide although restricting to a set of confident regions constructed by intersecting the confident regions for HG002 from GIAB v4.2.1 (lifted from GRCh38 to CHM13 v1.1), and the alignment regions produced by dipcall in the making of the CMRG v1.0 truth set (https://ftp-trace.ncbi.nlm.nih.gov/ReferenceSamples/giab/release/AshkenazimTrio/HG002_NA2 4385_son/CMRG_v1.00/CHM13v1.0/SupplementaryFiles/HG002v11-align2-CHM13v1.0/HG002 v11-align2-CHM13v1.0.dip.bed). Finally, we used the preliminary HG002 truth set from GIAB on CHM13 v2.0 which is equivalent to CHM13 v1.1 with the added chromosome Y from HG002. The calls in this set wer based on aligning a high-confidence assembly using dipcall 50 (labeled in figure as "dipcall CHM13 v2.0"). Here again, we intersected the confident regions with the GIAB v4.2.1 confident regions lifted from GRCh38 to CHM13.
Finally, we compared in greater detail the calling performance using the GRCh38-based and CHM13-based pangenomes by stratifying the evaluation across genomic region sets provided by the GIAB (https://github.com/genome-in-a-bottle/genome-stratifications The reads were then aligned to the pangenomes (after being converted to fastq with samtools fastq in the case of HG004) using GraphAligner (v.0.13) with '-x vg`with .gam output.. We parsed the GAM output to extract the first record as primary alignment. By overlapping the other alignment records with the primary alignment, we identified reads with split-mapping, i.e. with part of the read is mapped to a different location from the primary alignment. The alignment identity is reported by GraphAligner and was also extracted from the GAM.

SV Genotyping with PanGenie
Variants corresponding to nested sites in the HPGRC graph-derived VCFs were decomposed as described in 25 before running PanGenie version v2.1.0 with its default parameters. The HGSVC v.4.0 "lenient set" 32 was also included, but did not require decomposition. These three VCFs, annotated with all computed genotypes, are available for download here: https://zenodo.org/record/7669083. The genotyped samples were chosen by randomly selecting 100 trios for the 1KG data, 20 from each superpopulation. Samples present in HPRC and HGSVC were also included, for a total of 368. High coverage short reads from the 1KG 31 were used for genotyping. The leave-one-out experiments were performed as described in 25 and, like in that work, variants were "collapsed" using truvari collapse -r 500 -p 0.95 -P 0.95 -s 50 -S 100000 from Truvari 53 version 3.5.0 when comparing counts of genotyped variants (Figure 3 B,C,D). This is because near-identical insertions in the graph become completely separate variants in the VCF when, for the purposes of this comparison, we wish to treat them the same. SV deletions (insertions) were sites with reference alleles of length >= 50 (1) and alt alleles of length 1 (>=50). Sites that did not meet this criteria but had a reference or alt allele of length >= 50 were classified as "SV Other".

D. Melanogaster Graph Construction
The D. Melanogaster pangenome was created using Minigraph-Cactus using the procedure described in The Minigraph-Cactus Pangenome Pipeline secion . Progressive Cactus was run on the same input (which implies a star phylogeny) and was exported to vg with hal2vg.

D. Melanogaster Variant Decomposition
The variant sites in the pangenome (snarls, aka bubbles) were decomposed into canonical structural variants using a script developed for the HPRC analysis 25 . In brief, each allele in the deconstructed VCF specifies the corresponding path in the pangenome. The script follows these paths and, comparing them with the dm6 reference path, enumerates each canonical variant (SNP, indels, structural variants). The frequency of each variant in the pangenome corresponds to the number of assemblies that traverse their paths.

D. Melanogaster Graph Mapping and Variant Calling
The DGPR samples used are listed in Supplementary Table 6. Short reads were obtained using fasterq-dump -split 3 on the accessions in the last column of this table. Each read pair was mapped to the allele-frequency filtered graph with vg giraffe and to dm6 with BWA-MEM.
vg call was used to to genotype variants in the pangenome. For each sample, these variant calls were decomposed into canonical SVs using the same approach described above on the HPRC deconstructed VCF. The SV calls were then compared to the SVs in the pangenome using the sveval package 5 which matches SVs based on their types, sizes and location.Because SVs are genotyped using the same pangenome, they are expected to be relatively similar, and we can use standard "collapse" criteria to cluster them in SV sites. Two SVs were matched if: their regions had a reciprocal overlap of at least 90% for deletions and inversions; they were located at less than 100bp from each other, and their inserted sequences were at least 90% similar for insertions. The same approach was used to cluster the SVs alleles into the SV sites reported in the text and figures. The SV alleles were annotated with RepeatMasker (v4.0.9). We assigned a repeat class to a SV if more than 80% of the allelic sequence was annotated as such. The 80% threshold was chosen by inspecting the distribution and observing a negligible number of events below this value.
We used vg surject to produce BAM files referenced on dm6 from the mappings to the pangenome, and FreeBayes v1.3.6 35 (in the absence of a high quality DeepVariant model) to call variants on these mappings and those from BWA-MEM. Single-sample VCFs were merged with bcftools merge.
To compare the variant calls by both approaches, we used bcftools 54 (v1.10.2) to normalize the VCFs (bcftools norm), and compare them (bcftools isec) to mark variant sites where both approaches call a variant, and sites where only one approach does. We compared the number of calls in each category, across samples, and for different minimum variant quality thresholds (QUAL field or genotype quality GQ field).